Distance Between Strings
It's a nice feature of the latest word processor programs that they are capable of suggesting a replacement for a mistyped word. Spelling checkers know how to evaluated distance between a misspelled word and the words in its files. Words whose evaluated distance is the smallest are offered as candidates for replacement. The applet below helps you acquaint yourself with two possible distances between strings. These metric functions attempt to ascribe a numeric (actually, integer) value to the degree of dissimilarity between two strings. That the functions do indeed satisfy the metric axioms can be shown by induction. (Which is a good exercise too.)
Hamming Distance
The Hamming distance H is defined only for strings of the same length. For two strings s and t,
Levenshtein Distance
The Levenshtein (or edit) distance is more sophisticated. It's defined for strings of arbitrary length. It counts the differences between two strings, where we would count a difference not only when strings have different characters but also when one has a character whereas the other does not. The formal definition follows.
For a string s, let s(i) stand for its ith character. For two characters a and b, define
r(a, b) = 0 if a = b. Let r(a, b) = 1, otherwise. |
Assume we are given two strings s and t of length n and m, respectively. We are going to fill an (n+1)×(m+1) array d with integers such that the low right corner element d(n+1, m+1) will furnish the required values of the Levenshtein distance
The definition of entries of d is recursive. First set
(1) | d(i, j) = min(d(i-1, j)+1, d(i, j-1)+1, d(i-1, j-1) + r(s(i), t(j))) |
|
(Type your strings in the two edit controls at the bottom of the applet. Click "Do It!".)
You can experiment with the applet to verify the triangle inequality before trying to prove it. You may try to discover other features of functions H and L. If strings s and t have the same length then both

|Contact| |Front page| |Contents| |Up| |Algebra|
Copyright © 1996-2018 Alexander BogomolnyAssume two strings s and t have the same length. Is it true that

|Contact| |Front page| |Contents| |Up| |Algebra|
Copyright © 1996-2018 Alexander BogomolnyBoth Hamming and Levenshtein functions are metrics. The only portion of the definitions that may appear non-trivial is the triangle inequality. We show it by induction.
What do functions H and L stand for?
H(s, t)≤ H(s, r) + H(r, t) |
just asserts that changing s to t "via" r could not be worse than changing s into t directly.
L(s, t)≤ L(s, r) + L(r, t). |
Now, why is L(s, t) is the minimum number of edit operations that transform s into t. Let's look more carefully at the definition:
(1) | d(i, j) = min(d(i-1, j)+1, d(i, j-1)+1, d(i-1, j-1) + r(s(i), t(j))) |
To estimate the effort needed to transform s(0, i) - the first i characters of s - into t(0, j) - the first j characters of t - consider the three terms in (1), which correspond to the following situations:
Delete s(i) from s(0, i) and use d(i-1, j) operations to edit
s(0, i-1) intot(0, j) .Use
d(i, j-1) operations to edit s(0, i) into t(0, j-1) and then append t(j).Use
d(i-1, j-1) operations to edit s(0, i-1) into t(0, j-1) and then replace s(i) with t(j), if necessary.
If the three terms on the right in (1) are optimal, then the term on the left is also optimal, for, on the last step of editing, one can't do better than making a single change or not making any change at all.

|Contact| |Front page| |Contents| |Up| |Algebra|
Copyright © 1996-2018 Alexander Bogomolny70571002