Today I’d like to introduce you to an algorithm I stumbled over already tiwce
and that I really like because of its simple idea to address the non-trivial
problem to *quantify* the similarity of two strings: The Levenshtein algorithm.

The algorithm determines the so-called Levenshtein distance of two strings
which is a natural number. It’s an example of the mathematical technique of
*Dynamic Programming*
and was invented by the Russian mathematician Vladimir Levenshtein already
in 1965. The algorithm is used by some major companies (e.g. Yahoo!) in
production environments until today.

The core idea is to measure similarity by the minimal number of *elementary transformations*
that are needed to turn an arbitrary string *S* into another string, say *T*. Here,
*elementary transformations* are defined as precisely three operations:

- a simple insertion of a single character to the string
- a deletion of a single character of the string
- a substitution of a single character by another

Mathematically it works with a very elegant trick of computing the Levenshtein distances of trivial substrings first (which is very easy) and then successively computing distances of more complex substrings by only using previously evaluated values.

The mathematical formalism is well-explained at wikipedia and other resources
and I’d like to focus on a walk-through example now. Let’s compute the Levenshtein distance of the words *cat* and *cute*.

Note that the empty word is denoted by *ε*. Practically the algorithm is filling a table (a matrix) where the (i,j)-th entry will
contain the Levenshtein distance¹ of the prefix of length *i* of the first word and the prefix of the prefix of length *j* of the
second word.

We emphasize the (technical) convention that every word starts with the empty word, so *cat* can be written as *εcat*.
Therefore the first entry with indices (0,0) is the Levenshtein distance of the *0*-prefix of *εcat* and *εcute*, which is the
distance of the empty word *ε* with itself - which is simply zero. Furthermore, to construct any word of length *i*
starting with the empty word we simply need *i* insertions. With this in mind it is
easy to write down the first row immediately:

This translates as: It takes one elementary transformation to turn *ε* into *εc* (an insertion), two transformations
to turn *ε* into *εca* and so on. The first column can be obtained immediately with the same strategy:

Let’s combine them. Our recurrence matrix *D* has now the form:

All we have to do is to fill the question mark entries by using only the previously computed values. This
is the point were we need to inspect the mathematical rules the algorithm postulates. Suppose we want to obtain the matrix entry (i, j) and
remember that this corresponds to the prefix of length *i* of the word *εcat* and prefix of length *j* of the word *εcute*:

When the *i*-ith character of the first word *S* exactly matches the *j*-th character of the second word *T* we have
the optimal case. No “operation” is required, the global costs are the same as for the distance of the (i-1)-th and the (j-1)-th
distances of the words, that is in that case. So we take the matrix entry *(i-1, j-1)* and put it into *(i, j)* as well.
Otherwise the characters do not match. We have execute one of the elementary operations. But which one? Here’s the answer:

The `insertion costs`

, `deletion costs`

and `substitution costs`

are nothing else
than values of neighbor matrix entries we already have computed! So have a look to the neighbors,
choose the one with mininal costs and add *+1* to its costs, because we need exactly one more operation
compared to that “minimal” neighbor. Notice that here we have chosen some kind of optimal *path*.

Let’s see how it works for our example matrix above. For the entry (1,1) we have the optimal case:
A *c* is added for both sides (when starting with *ε*) and since the same character is added on both sides
the Levenshtein distance stays zero because it was for the upper-left neighbor. And it matches the
intuition if you recapitulate that entry (1,1) means the distance between *εc* and *εc*, which still
is the same string.

Now the situation gets more interesting. Let’s focus on the second row, where two question
marks are left. The more left question mark is the entry *(2, 1)* and corresponds to the
distance of *εca* and *εc*. The algorithm now postulates looking for the minimal distance
relative to previously computed substrings. The already-computed neighbors have values *2*,
*1* and *0*. The mininal one is *0*. We now take the minimal value (0) and add +1 to it to obtain
the value for *(2, 1)*. This procedure is repeated successively for each question mark left. If we
consequently do this we obtain:

… and finally the matrix entry (m, n) where *m* is the length of the first string and *n* is the length of the second string
is defined as the Levenshtein distance of the two strings. So we have determinded the Levenshtein distance of *cat* and
*cute*: *2*.

When I reasoned if the algorithm would be good choice for our specific problem I immediately started hacking together some naive lines in my editor. I knew there were several stable implementations out there and my solution most probably wouldn’t add any new feature nor would have significant performance improvements, but I wanted to go through the algorithm on my own step by step. So basically I started reimplenting it. But why reinvent the wheel?

As developers we’re used to the common folklore law of *You-simply-should-not-reinvent-the-wheel*. This is
absolutely valid when it comes to the question what to use in production environments (think of
cryptography!). But if you’re interested in extending your toolset of concepts the answer is different: Rebuilding the
algorithm on your own - independent of its difficulty - will make you learn something more than *Copy&Paste*. It’s the
same phenomenon that transcribing something from the blackboard in school by yourself will pay off way more than simply photocopying your neighbor’s notes 😉 .

By the way, there is no specific advantage in choosing Ruby here. You’re fine to use any turing complete programming language that you like and do the same. Of course there already are plenty of Ruby variants of the Levenshtein algorithm, for example I found a very minimalistic implementation here:

Its pretty minified² but therefore hard to read when you want to learn how to
apply the algorithm by youself. I felt the drive for giving it more structure (and therefore
bloating it intentionally). The first object I focused on was
the so-called *recurrence matrix*, usually called *D* symbolized by the `costs`

Array in the
minimalistic version. This is the table we filled in the example above. For me it felt like an own container data type
(in the fashion of a monad), a somewhat supercharged Array. So I defined an own class for it:

The crucial point here is that the `RecurrenceMatrix`

automatically
fills the entries that are initially known. This container class may look
overengineered here, but note that one can explicitly say

which I really like because it says without any comment what it is, what its purpose
is and how to use it (“Matrix” in the class name implictly gives the hint to the
user to make use of the operator `:[]`

, at least this is what I would expect).

Then I thought about how the outer API of my fictional class `LevenshteinDistance`

could look like. The Levenshtein distance
is a natural number³, so what I had in mind is

```
result = LevenshteinDistance.new('cat', 'cute').to_i
```

And the `to_i`

method glues the successive computations together
and returns the last value computed. Here, `d`

is a recurrence matrix
as reasoned above.

The only item missing is the computation of the costs for each matrix entry. Here’s what I did based on the mathematical formulation:

One could take this one step more consequent and move all the computation logic inside the `RecurrenceMatrix`

.
But alltogether the state above felt fine to me and made me ready for switching to use a more performant implementation relying on C instead :)

Note that my implementation works well for strings that aren’t long, say below 500 characters. During my heuristically flavoured
(and therefore non-scientific) benchmark on my local machine it took 1.28 seconds to compute the
Levenshtein distance of strings of length 1000 and ~30 seconds to compute the Levenshtein distance of
strings of length 10000. Theoretically the algorithm itself is of complexity *O(mn)* where *m* and *n* are the sizes
of the input strings. So in conclusion, to compare really large strings you may want to use a Ruby gem
that relies on native C code extensions or similar like

I hope you feel encouraged now playing with algorithms or other concepts from academia that are unknown to you!

- source code of the implementation discussed here
- Levenshtein algorithm at wikipedia
- A very interesting blogpost that takes things one step further and includes thoughts about optimizations and the Levenshtein algorithm seen from an autmata theory perspective (including a Python implementation)
- A presentation that summarizes more related variants and optimization ideas

¹) I implictly defined the *cost* of each operation as +1 as it is “per default”. This can be modified
of course, but this is not in the scope of this write-up. Just keep in mind
that this is a feature that can be adjusted.

²) … which is totally okay because it was intended as a short concise implementation there!

³) It’s completely off the topic, but an interesting philosophical question indeed: The discussion if zero is treated as a natural number or not 😉