2.5 Minimum Edit Distance

Edit distance measures similarity between strings.

DEF given a set of string editing operations, minimum edit distance between two strings is defined as the minimum number of operations needed to transform one string into another.

  • Common operations include insertion(insert a character), deletion(delete a character), substitution(replace a character with another).

  • DEF Minimum edit distance can be represented by an alignment. This is an alignment between INTENTION and EXECUTION:

    I N T E * N T I O N
    | | | | | | | | | |
    * E X E C U T I O N
    • Every character is mapped either to a character in the other string, or a null character.

    • Mappings may not cross.

  • Operations can be assigned costs. Levenshtein distance ( ref: [1] ) assumes every operation has a cost of 1 except the substitution of a letter for itself(E-E in the case above), which has a zero cost.

  • ALGO Minimum edit distance problem can be solved through dynamic programming. DP equation:

    • D[i,j] = min(
          D[i-1,j]+delete_cost(source[i]),
          D[i,j-1]+insert_cost(target[j]),
          D[i-1,j-1]+substitute_cost(source[i],target[j])
      )
    • The algorithm gives minimum cost of transferring string source into string target. D[i,j] represents the optimal cost of transferring an i-character long prefix of source into a j-character-long prefix of target.

    • substitute_cost is zero when source[i] == target[j].

  • Viterbi algorithm is a probabilistic extension of minimum edit distance.

Reference

[1] Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Cybernetics and Control Theory, 10(8), 707–710. Original in Doklady Akademii Nauk SSSR 163(4): 845–848 (1965).

Last updated

Was this helpful?