Today I’d like to introduce you to an algorithm I stumbled over already tiwce and that I really like because of its simple idea to address the non-trivial problem to quantify the similarity of two strings: The Levenshtein algorithm.
The algorithm determines the so-called Levenshtein distance of two strings which is a natural number. It’s an example of the mathematical technique of Dynamic Programming and was invented by the Russian mathematician Vladimir Levenshtein already in 1965. The algorithm is used by some major companies (e.g. Yahoo!) in production environments until today.
The core idea is to measure similarity by the minimal number of elementary transformations that are needed to turn an arbitrary string S into another string, say T. Here, elementary transformations are defined as precisely three operations:
Mathematically it works with a very elegant trick of computing the Levenshtein distances of trivial substrings first (which is very easy) and then successively computing distances of more complex substrings by only using previously evaluated values.
The mathematical formalism is well-explained at wikipedia and other resources and I’d like to focus on a walk-through example now. Let’s compute the Levenshtein distance of the words cat and cute.
Note that the empty word is denoted by ε. Practically the algorithm is filling a table (a matrix) where the (i,j)-th entry will contain the Levenshtein distance¹ of the prefix of length i of the first word and the prefix of the prefix of length j of the second word.
We emphasize the (technical) convention that every word starts with the empty word, so cat can be written as εcat. Therefore the first entry with indices (0,0) is the Levenshtein distance of the 0-prefix of εcat and εcute, which is the distance of the empty word ε with itself - which is simply zero. Furthermore, to construct any word of length i starting with the empty word we simply need i insertions. With this in mind it is easy to write down the first row immediately:
This translates as: It takes one elementary transformation to turn ε into εc (an insertion), two transformations to turn ε into εca and so on. The first column can be obtained immediately with the same strategy:
Let’s combine them. Our recurrence matrix D has now the form:
All we have to do is to fill the question mark entries by using only the previously computed values. This is the point were we need to inspect the mathematical rules the algorithm postulates. Suppose we want to obtain the matrix entry (i, j) and remember that this corresponds to the prefix of length i of the word εcat and prefix of length j of the word εcute:
When the i-ith character of the first word S exactly matches the j-th character of the second word T we have the optimal case. No “operation” is required, the global costs are the same as for the distance of the (i-1)-th and the (j-1)-th distances of the words, that is in that case. So we take the matrix entry (i-1, j-1) and put it into (i, j) as well. Otherwise the characters do not match. We have execute one of the elementary operations. But which one? Here’s the answer:
The insertion costs
, deletion costs
and substitution costs
are nothing else
than values of neighbor matrix entries we already have computed! So have a look to the neighbors,
choose the one with mininal costs and add +1 to its costs, because we need exactly one more operation
compared to that “minimal” neighbor. Notice that here we have chosen some kind of optimal path.
Let’s see how it works for our example matrix above. For the entry (1,1) we have the optimal case: A c is added for both sides (when starting with ε) and since the same character is added on both sides the Levenshtein distance stays zero because it was for the upper-left neighbor. And it matches the intuition if you recapitulate that entry (1,1) means the distance between εc and εc, which still is the same string.
Now the situation gets more interesting. Let’s focus on the second row, where two question marks are left. The more left question mark is the entry (2, 1) and corresponds to the distance of εca and εc. The algorithm now postulates looking for the minimal distance relative to previously computed substrings. The already-computed neighbors have values 2, 1 and 0. The mininal one is 0. We now take the minimal value (0) and add +1 to it to obtain the value for (2, 1). This procedure is repeated successively for each question mark left. If we consequently do this we obtain:
… and finally the matrix entry (m, n) where m is the length of the first string and n is the length of the second string is defined as the Levenshtein distance of the two strings. So we have determinded the Levenshtein distance of cat and cute: 2.
When I reasoned if the algorithm would be good choice for our specific problem I immediately started hacking together some naive lines in my editor. I knew there were several stable implementations out there and my solution most probably wouldn’t add any new feature nor would have significant performance improvements, but I wanted to go through the algorithm on my own step by step. So basically I started reimplenting it. But why reinvent the wheel?
As developers we’re used to the common folklore law of You-simply-should-not-reinvent-the-wheel. This is absolutely valid when it comes to the question what to use in production environments (think of cryptography!). But if you’re interested in extending your toolset of concepts the answer is different: Rebuilding the algorithm on your own - independent of its difficulty - will make you learn something more than Copy&Paste. It’s the same phenomenon that transcribing something from the blackboard in school by yourself will pay off way more than simply photocopying your neighbor’s notes 😉 .
By the way, there is no specific advantage in choosing Ruby here. You’re fine to use any turing complete programming language that you like and do the same. Of course there already are plenty of Ruby variants of the Levenshtein algorithm, for example I found a very minimalistic implementation here:
Its pretty minified² but therefore hard to read when you want to learn how to
apply the algorithm by youself. I felt the drive for giving it more structure (and therefore
bloating it intentionally). The first object I focused on was
the so-called recurrence matrix, usually called D symbolized by the costs
Array in the
minimalistic version. This is the table we filled in the example above. For me it felt like an own container data type
(in the fashion of a monad), a somewhat supercharged Array. So I defined an own class for it:
The crucial point here is that the RecurrenceMatrix
automatically
fills the entries that are initially known. This container class may look
overengineered here, but note that one can explicitly say
which I really like because it says without any comment what it is, what its purpose
is and how to use it (“Matrix” in the class name implictly gives the hint to the
user to make use of the operator :[]
, at least this is what I would expect).
Then I thought about how the outer API of my fictional class LevenshteinDistance
could look like. The Levenshtein distance
is a natural number³, so what I had in mind is
result = LevenshteinDistance.new('cat', 'cute').to_i
And the to_i
method glues the successive computations together
and returns the last value computed. Here, d
is a recurrence matrix
as reasoned above.
The only item missing is the computation of the costs for each matrix entry. Here’s what I did based on the mathematical formulation:
One could take this one step more consequent and move all the computation logic inside the RecurrenceMatrix
.
But alltogether the state above felt fine to me and made me ready for switching to use a more performant implementation relying on C instead :)
Note that my implementation works well for strings that aren’t long, say below 500 characters. During my heuristically flavoured (and therefore non-scientific) benchmark on my local machine it took 1.28 seconds to compute the Levenshtein distance of strings of length 1000 and ~30 seconds to compute the Levenshtein distance of strings of length 10000. Theoretically the algorithm itself is of complexity O(mn) where m and n are the sizes of the input strings. So in conclusion, to compare really large strings you may want to use a Ruby gem that relies on native C code extensions or similar like
I hope you feel encouraged now playing with algorithms or other concepts from academia that are unknown to you!
¹) I implictly defined the cost of each operation as +1 as it is “per default”. This can be modified of course, but this is not in the scope of this write-up. Just keep in mind that this is a feature that can be adjusted.
²) … which is totally okay because it was intended as a short concise implementation there!
³) It’s completely off the topic, but an interesting philosophical question indeed: The discussion if zero is treated as a natural number or not 😉