0% found this document useful (0 votes)
9 views

Edit Distance

Uploaded by

Burim Baftijari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Edit Distance

Uploaded by

Burim Baftijari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Dynamic Programming:

The Edit Distance Problem


CS 2: Introduction to Programming Methods
9 February 2004
The Edit Distance problem
 Problem: what is the cheapest way to transform one
word (the source) into another word (the output)?
 Example: transform “algorithm” into “alligator”.
 Initially, you start at the first character of the source and
have an empty output.
 At any point, you can:
 Delete the current character of the source.
 Insert a new character into the output word.
 Copy the current character of the source into the output.
 Copying and deleting move you to the next character.

CS 2: Introduction to Programming Methods The Edit Distance Problem 18


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Example: “algorithm”  “alligator”

Source: a l g o r i t h m
Output:

Operations: (none)

CS 2: Introduction to Programming Methods The Edit Distance Problem 19


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Example: “algorithm”  “alligator”

Source: a l g o r i t h m
Output:

Operations: (none)

CS 2: Introduction to Programming Methods The Edit Distance Problem 20


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Example: “algorithm”  “alligator”

Source: a l g o r i t h m
Output: a

Operations: Copy

CS 2: Introduction to Programming Methods The Edit Distance Problem 21


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Example: “algorithm”  “alligator”

Source: a l g o r i t h m
Output: a l

Operations: Copy, Copy

CS 2: Introduction to Programming Methods The Edit Distance Problem 22


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Example: “algorithm”  “alligator”

Source: a l g o r i t h m
Output: a l l

Operations: Copy, Copy, Insert(l)

CS 2: Introduction to Programming Methods The Edit Distance Problem 23


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Example: “algorithm”  “alligator”

Source: a l g o r i t h m
Output: a l l i

Operations: Copy, Copy, Insert(l), Insert(i)

CS 2: Introduction to Programming Methods The Edit Distance Problem 24


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Example: “algorithm”  “alligator”

Source: a l g o r i t h m
Output: a l l i g

Operations: Copy, Copy, Insert(l), Insert(i), Copy

CS 2: Introduction to Programming Methods The Edit Distance Problem 25


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Example: “algorithm”  “alligator”

Source: a l g o r i t h m
Output: a l l i g

Operations: Copy, Copy, Insert(l), Insert(i), Copy


Delete

CS 2: Introduction to Programming Methods The Edit Distance Problem 26


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Example: “algorithm”  “alligator”

Source: a l g o r i t h m
Output: a l l i g

Operations: Copy, Copy, Insert(l), Insert(i), Copy


Delete, Delete

CS 2: Introduction to Programming Methods The Edit Distance Problem 27


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Example: “algorithm”  “alligator”

Source: a l g o r i t h m
Output: a l l i g

Operations: Copy, Copy, Insert(l), Insert(i), Copy


Delete, Delete, Delete

CS 2: Introduction to Programming Methods The Edit Distance Problem 28


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Example: “algorithm”  “alligator”

Source: a l g o r i t h m
Output: a l l i g a

Operations: Copy, Copy, Insert(l), Insert(i), Copy


Delete, Delete, Delete, Insert(a)

CS 2: Introduction to Programming Methods The Edit Distance Problem 29


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Example: “algorithm”  “alligator”

Source: a l g o r i t h m
Output: a l l i g a t

Operations: Copy, Copy, Insert(l), Insert(i), Copy


Delete, Delete, Delete, Insert(a), Copy

CS 2: Introduction to Programming Methods The Edit Distance Problem 30


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Example: “algorithm”  “alligator”

Source: a l g o r i t h m
Output: a l l i g a t

Operations: Copy, Copy, Insert(l), Insert(i), Copy


Delete, Delete, Delete, Insert(a), Copy, Delete,
Delete

CS 2: Introduction to Programming Methods The Edit Distance Problem 31


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Example: “algorithm”  “alligator”

Source: a l g o r i t h m
Output: a l l i g a t o r

Operations: Copy, Copy, Insert(l), Insert(i), Copy


Delete, Delete, Delete, Insert(a), Copy, Delete,
Delete, Insert(o), Insert(r)

CS 2: Introduction to Programming Methods The Edit Distance Problem 32


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
The Edit Distance problem
 Problem: what is the cheapest way to transform one
word (the source) into another word (the output)?
 At any point, you can:
 Delete the current character of the source.
 Insert a new character into the output word.
 Copy the current character of the source into the output.
 Each operation has a cost associated with it.
 The cost of a transformation is the sum of the costs
of each operation in the sequence.

CS 2: Introduction to Programming Methods The Edit Distance Problem 33


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Example: “algorithm”  “alligator”

 Operations: Copy, Copy, Insert(l), Insert(i), Copy


Delete, Delete, Delete, Insert(a), Copy, Delete,
Delete, Insert(o), Insert(r)
 Assume that:
 Copying costs 3
 Inserting costs 5
 Deleting costs 2
 The cost of the above transformation is: 47
 This is just 3 + 3 + 5 + 5 + 3 + 2 + 2 + ...

CS 2: Introduction to Programming Methods The Edit Distance Problem 34


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
The Edit Distance problem
 Problem: what is the cheapest way to transform one
word (the source) into another word (the output)?
 At any point, you can:
 Delete the current character of the source.
 Insert a new character into the output word.
 Copy the current character of the source into the output.
 By “cheapest”, we mean the transformation with the
least cost.

CS 2: Introduction to Programming Methods The Edit Distance Problem 35


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
How to solve this problem?
 Will trying out all possible transformations work?

CS 2: Introduction to Programming Methods The Edit Distance Problem 36


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
How to solve this problem?
 Will trying out all possible transformations work?
 Answer: It can, but it would take way too long.

CS 2: Introduction to Programming Methods The Edit Distance Problem 37


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Will dynamic programming work?
 Depends on our problem.
 Can our problem be solved based on the solution of
some subproblems?
 What are the subproblems, if there any?

CS 2: Introduction to Programming Methods The Edit Distance Problem 38


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Will dynamic programming work?
 Depends on our problem.
 Can our problem be solved based on the solution of
some subproblems?
 What are the subproblems, if there any?
 As is typical when trying out dynamic programming,
it helps to find some “key observation” or insight.
 We will need three observations for this problem.

CS 2: Introduction to Programming Methods The Edit Distance Problem 39


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Notation
 (s1, s2, s3, ..., sn) stands for a sequence of n elements.
 Used for a list of operations.
 Example: (Copy, Copy, Insert(c), Delete).
 Also used for strings.
 Example: “help” would correspond to (h, e, l ,p).

CS 2: Introduction to Programming Methods The Edit Distance Problem 40


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Key Observation #1
 Given: Strings X = (x1, ..., xn) and Y = (y1, ..., yt), and
a sequence S of operations (s1, s2, ..., sm).

 Let S’ = (s1, ..., s(m-1)), X’ = (x1, ..., x(n-1)), and


Y’ = (y1, ..., y(t-1)).

 Suppose that S is the cheapest sequence of operations


to transform X into Y, and that sm is a Copy operation.

CS 2: Introduction to Programming Methods The Edit Distance Problem 41


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Key Observation #1 (cont.)
 Claim: then, S’ is the cheapest sequence of operations
to transform X’ to Y’.

 Proof: by contradiction. Suppose that T = (t1, ..., tk) is


cheaper than S’ and tranforms X’ to Y’.

 Then (t1, ..., tk, sm) is cheaper than S and transforms X


to Y. This is a contradiction, as S was supposed to be
the cheapest way to transform X to Y.

CS 2: Introduction to Programming Methods The Edit Distance Problem 42


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Key Observation #2
 Given: Strings X = (x1, ..., xn) and Y = (y1, ..., yt), and
a sequence S of operations (s1, s2, ..., sm).

 Let S’ = (s1, ..., s(m-1)) and X’ = (x1, ..., x(n-1)).

 Suppose that S is the cheapest sequence of operations


to transform X into Y, and that sm is a Delete operation.

CS 2: Introduction to Programming Methods The Edit Distance Problem 43


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Key Observation #2 (cont.)
 Claim: then, S’ is the cheapest sequence of operations
to transform X’ to Y.

 Proof: by contradiction. Suppose that T = (t1, ..., tk) is


cheaper than S’ and tranforms X’ to Y.

 Then (t1, ..., tk, sm) is cheaper than S and transforms X


to Y. This is a contradiction, as S was supposed to be
the cheapest way to transform X to Y.

CS 2: Introduction to Programming Methods The Edit Distance Problem 44


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Key Observation #3
 Given: Strings X = (x1, ..., xn) and Y = (y1, ..., yt), and
a sequence S of operations (s1, s2, ..., sm).

 Let S’ = (s1, ..., s(m-1)) and Y’ = (y1, ..., y(t-1)).

 Suppose that S is the cheapest sequence of operations


to transform X into Y, and that sm is an Insert(yn)
operation.

CS 2: Introduction to Programming Methods The Edit Distance Problem 45


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Key Observation #3 (cont.)
 Claim: then, S’ is the cheapest sequence of operations
to transform X to Y’.

 Proof: by contradiction. Suppose that T = (t1, ..., tk) is


cheaper than S’ and tranforms X to Y’.

 Then (t1, ..., tk, sm) is cheaper than S and transforms X


to Y. This is a contradiction, as S was supposed to be
the cheapest way to transform X to Y.

CS 2: Introduction to Programming Methods The Edit Distance Problem 46


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
An observation about the observations...
 Given X, Y, and S, we have covered all possible cases
for sm (the last operation).

CS 2: Introduction to Programming Methods The Edit Distance Problem 47


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Putting the observations to use
 The best sequence of operations to transform X to Y
depended on one of the following:
 (1) The best way to transform X’ to Y’.
 (2) The best way to transform X to Y’.
 (3) The best way to transform X’ to Y.
 The three cases become the subproblems to consider.
 Given the solution to all three, we can find the solution to our
actual problem (transforming X to Y).
 Why? The best solution to transforming X to Y must contain a
solution to one of the three cases by the three observations.

CS 2: Introduction to Programming Methods The Edit Distance Problem 48


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
How to organize the subproblems
 Each subproblem is characterized by some initial part of
the original strings X and Y.
 So use a matrix.
X
a l g o r i

a
Y
l

CS 2: Introduction to Programming Methods The Edit Distance Problem 49


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
How to organize the subproblems (cont.)
 The marked location contains the score for the cheapest
way to transform “algo” to “a”.

X
a l g o r i

a
Y
l

CS 2: Introduction to Programming Methods The Edit Distance Problem 50


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
How to organize the subproblems (cont.)
 The marked location contains the score for the cheapest
way to transform “alg” to “”.

X
a l g o r i

a
Y
l

CS 2: Introduction to Programming Methods The Edit Distance Problem 51


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
How to organize the subproblems (cont.)
 The marked location contains the score for the cheapest
way to transform “” to “a”.

X
a l g o r i

a
Y
l

CS 2: Introduction to Programming Methods The Edit Distance Problem 52


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
How to organize the subproblems (cont.)
 The marked location contains the score for the cheapest
way to transform “” to “”.
 This is the smallest problem possible.
X
a l g o r i

a
Y
l

CS 2: Introduction to Programming Methods The Edit Distance Problem 53


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Additional information to store
 Each location should store the last operation in the
cheapest sequence of operations used to get there.

X
a l g o r i

a
Y
l

CS 2: Introduction to Programming Methods The Edit Distance Problem 54


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Additional information to store (cont.)
 Example: the marked location would contain a score,
and possibly a Copy operation.

X
a l g o r i

a
Y
l

CS 2: Introduction to Programming Methods The Edit Distance Problem 55


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
How to fill out the matrix
 For concreteness, assume that Copy costs 5, Insert(c)
costs 10, and Delete costs 10.

X
a l g o r i

a
Y
l

CS 2: Introduction to Programming Methods The Edit Distance Problem 56


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
How to fill out the matrix (cont.)
 Start at the upper left.
 This is trivial: score is zero, operation is null.

X
a l g o r i

0
null
a
Y
l

CS 2: Introduction to Programming Methods The Edit Distance Problem 57


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
How to fill out the matrix (cont.)
 All other locations depend on the values in up to three
other places.

X
a l g o r i

0
null
a
Y
l

CS 2: Introduction to Programming Methods The Edit Distance Problem 58


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
How to fill out the matrix (cont.)
 Diagonal movement corresponds to a Copy operation.
 Vertical  Insert(c).
 Horizontal  Delete.
X
a l g o r i

0
null
a
Y
l

CS 2: Introduction to Programming Methods The Edit Distance Problem 59


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
How to fill out the matrix (cont.)
 Fill out each row in turn.
 Only one option for the first row...

X
a l g o r i

0 10
null Del
a
Y
l

CS 2: Introduction to Programming Methods The Edit Distance Problem 60


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
How to fill out the matrix (cont.)
 Fill out each row in turn.
 Only one option for the first row...

X
a l g o r i

0 10 20 30 40 50 60
null Del Del Del Del Del Del
a
Y
l

CS 2: Introduction to Programming Methods The Edit Distance Problem 61


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
How to fill out the matrix (cont.)
 Fill out each row in turn.

X
a l g o r i

0 10 20 30 40 50 60
null Del Del Del Del Del Del
a 10
Y Ins(a)
l

CS 2: Introduction to Programming Methods The Edit Distance Problem 62


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
How to fill out the matrix (cont.)
 Three possibilities to consider for the marked square.
 Clear that Copy is cheapest. (5 versus 20 for Insert(a)
or Delete).
X
a l g o r i

0 10 20 30 40 50 60
null Del Del Del Del Del Del
a 10
Y Ins(a)
l

CS 2: Introduction to Programming Methods The Edit Distance Problem 63


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
How to fill out the matrix (cont.)
 Three possibilities to consider for the marked square.
 Clear that Copy is cheapest. (5 versus 20 for Insert(a)
or Delete).
X
a l g o r i

0 10 20 30 40 50 60
null Del Del Del Del Del Del
a 10 5
Y Ins(a) Cop
l

CS 2: Introduction to Programming Methods The Edit Distance Problem 64


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
How to fill out the matrix (cont.)
 Two possibilities to consider for the marked square.
 Copy is not possible since “a” != “l”.

X
a l g o r i

0 10 20 30 40 50 60
null Del Del Del Del Del Del
a 10 5
Y Ins(a) Cop
l

CS 2: Introduction to Programming Methods The Edit Distance Problem 65


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
How to fill out the matrix (cont.)
 If we chose Insert(a)....

X
a l g o r i

0 10 20 30 40 50 60
null Del Del Del Del Del Del
a 10 5 30
Y Ins(a) Cop Ins(a)
l

CS 2: Introduction to Programming Methods The Edit Distance Problem 66


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
How to fill out the matrix (cont.)
 But Delete is cheaper!

X
a l g o r i

0 10 20 30 40 50 60
null Del Del Del Del Del Del
a 10 5 15
Y Ins(a) Cop Del
l

CS 2: Introduction to Programming Methods The Edit Distance Problem 67


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
How to fill out the matrix (cont.)
 Fill the rest of the matrix out in a similar fashion.

X
a l g o r i

0 10 20 30 40 50 60
null Del Del Del Del Del Del
a 10 5 15 25 35 45 55
Y Ins(a) Cop Del Del Del Del Del
l 20 15 10 20 30 40 50
Ins(l) Ins(l) Cop Del Del Del Del
CS 2: Introduction to Programming Methods The Edit Distance Problem 68
http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Transforming “algori” to “al”
 The cheapest sequence of operations has cost: 50

X
a l g o r i

0 10 20 30 40 50 60
null Del Del Del Del Del Del
a 10 5 15 25 35 45 55
Y Ins(a) Cop Del Del Del Del Del
l 20 15 10 20 30 40 50
Ins(l) Ins(l) Cop Del Del Del Del
CS 2: Introduction to Programming Methods The Edit Distance Problem 69
http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Transforming “algori” to “al” (cont.)
 Recover the sequence by working backwards.
 We get: Copy, Copy, Delete, Delete, Delete, Delete.

X
a l g o r i

0 10 20 30 40 50 60
null Del Del Del Del Del Del
a 10 5 15 25 35 45 55
Y Ins(a) Cop Del Del Del Del Del
l 20 15 10 20 30 40 50
Ins(l) Ins(l) Cop Del Del Del Del
CS 2: Introduction to Programming Methods The Edit Distance Problem 70
http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Why this worked
 This worked becuase of the three observations we
made earlier.

X
a l g o r i

0 10 20 30 40 50 60
null Del Del Del Del Del Del
a 10 5 15 25 35 45 55
Y Ins(a) Cop Del Del Del Del Del
l 20 15 10 20 30 40 50
Ins(l) Ins(l) Cop Del Del Del Del
CS 2: Introduction to Programming Methods The Edit Distance Problem 71
http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Why this worked (cont.)
 The best answer to put in a location must use the best
solution to one of three possible subproblems.

X
a l g o r i

a
Y
l

CS 2: Introduction to Programming Methods The Edit Distance Problem 72


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Why this worked (cont.)
 So we solved those subproblems first.
 Then we considered cases and figured out how best to
solve our current problem.
X
a l g o r i

a
Y
l

CS 2: Introduction to Programming Methods The Edit Distance Problem 73


http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Why this worked (cont.)
 We could recover the cheapest sequence of operations
since we stored operations at each step.

X
a l g o r i

0 10 20 30 40 50 60
null Del Del Del Del Del Del
a 10 5 15 25 35 45 55
Y Ins(a) Cop Del Del Del Del Del
l 20 15 10 20 30 40 50
Ins(l) Ins(l) Cop Del Del Del Del
CS 2: Introduction to Programming Methods The Edit Distance Problem 74
http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004
Why this worked (cont.)
 Each location’s operation tells us where to look for the
previous one in the sequence.

X
a l g o r i

0 10 20 30 40 50 60
null Del Del Del Del Del Del
a 10 5 15 25 35 45 55
Y Ins(a) Cop Del Del Del Del Del
l 20 15 10 20 30 40 50
Ins(l) Ins(l) Cop Del Del Del Del
CS 2: Introduction to Programming Methods The Edit Distance Problem 75
http://www.cs.caltech.edu/courses/cs2/ 9 Feburary 2004

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy