Sequence Alignment¶
Edit Distance¶
The Needleman-Wunsch score defines the edit distance as the sum of 'cost' of mismatches and the 'cost' of gaps.
Motivation¶
Edit distance is useful for finding the difference between two strings. This is helpful in using diff
operations, autocorrect, speech recognition, and computational biology.
Problem Definition¶
We are given two strings, \(X = [x_i]\) and \(Y = [y_i]\) of lengths \(m\) and \(n\) respectively. The goal is to find the insertion of gaps so as to achieve the minimum edit distance for these two strings.
\(\Omega(A, B)\) = optimal solution for strings \(A\) and \(B\).
Subproblem Definition¶
Let's look at the end of these strings. We have three possibilities. One is that they're both perfectly aligned, the other two involve one of them ending in a gap. Let us define the ideal strings (gaps included) as \(X'\) and \(Y'\), of lenths \(m'\) and \(n'\) respectively. We may end up with three cases:
- \(m = n\) => \(\Omega(X,Y) = f(\Omega(X-x_m, Y-y_n))\)
- \(m > n\) => \(\Omega(X,Y) = f(\Omega(X-x_m, Y))\)
- \(m < n\) => \(\Omega(X,Y) = f(\Omega(X, Y-y_n))\)
Proof of Optimal Substructure¶
Claim: \(\Omega(X,Y)\) with \(x_m, y_n\) removed = \(\Omega(X-x_m, Y-y_n)\).
If there is any other string \(\overline{X}\) , \(\overline{Y}\) that gives \(\Omega(\overline{X},\overline{Y}) > \Omega(X-x_m, Y-y_m)\), that would imply that \(\Omega(\overline{X}, \overline{Y}) + \alpha (x_m, y_m)\) > \(\Omega(X, Y)\), which would violate the optimality of the solution. Therefore, no such string is possible.
Recurrence Relation¶
Define \(A_i\) as the string \(A\) up till \(i\) (included)
Base Case:
Complexity¶
This gives us time and space complexities of \(\Theta(mn)\) , when we memoise the DP table.
Pseudo-Code¶
Define P[m+1, n+1]
Align (X, Y):
for each i: P[i, 0] = i.alpha(gap)
for each j: P[0, j] = j.alpha(gap)
for i from 1 to m:
for j from 1 to n:
P[i, j] = min(P[i-1, j-1] + alpha(x[i], y[j]),
P[i, j-1] + alpha(gap)
P[i-1, j] + alpha(gap)
)
return P[m, n]
Space Optimisation¶
Each row only needs the values in the current row, and the row above it to fill (provided we are filling the table in row major). In this scenario, we choose to only store the current row and the row above it. This improves our time complexity to \(O(min(m, n))\).