Update (May 26th 2014): See the RESTful web service implementation of this algorithm here

Given two DNA sequences, you are asked to align these two sequences where for each non-matching nucleotide pair you will be penalized by 1 point and for each gap (space) you insert into a sequence to shift a portion of the sequence, you will be penalized by 2 points. The requirement is that you will have to do this alignment with the least amount of penalties as possible (thus minimizing the total edit distance between these two sequences).

Here is an example:

Align:

GC

C

First possible alignment:

GC

C-

edit distance: mismatch(G,C): 1 point penalty + gap (C,-): 2 point penalty = 3

Second possible alignment:

GC

-C

edit distance: gap (G,-): 2 point penalty

Since the second alignment gives us the smaller edit distance (least total penalties), the second alignment is optimal.

This problem can be solved for any pair of sequences of arbitrary sizes using a dynamic programming approach that is known as the Needleman-Wunsch algorithm published in 1970.

This algorithm is used for the *global alignment* problem in bioinformatics, where the sequences are of more or less the same size and largely similar. The gist of the algorithm is to divide the problem into smaller sub-problems (align prefixes and build the alignment of the whole sequences based on that). The advantage of dynamic programming is that it computes the solution to a sub-problem only once.

If the sequences are of significantly different sizes and/or mostly dissimilar, *local alignment* algorithms are used to identify the sub-sequences that are similar. See the

Smith-Waterman algorithm for local alignment.

These algorithms are so common in bioinformatics–they are part of many online toolkits. Likewise, many implementations in Java, Python and other languages can be found online. My suggestion is that you implement the algorithm from scratch to appreciate its elegance and to convince yourselves that it indeed works.

Here is my implementation of the Needleman-Wunsch algorithm in Java:

public class NeedlemanWunsch { public static boolean verbose=false; /* * prints out the score matrix */ public static void dumpMatrix(int[][] matrix, String row, String column){ System.out.print(String.format("%5s","")); for (int j =0; j< row.length(); j++){ System.out.print(String.format("%5s", row.charAt(j)+" ")); } System.out.println(); for (int i =0; i< column.length(); i++){ System.out.print(String.format("%5s",column.charAt(i)+ " ")); for (int j =0; j< row.length(); j++){ System.out.print(String.format("%5s", matrix[j][i]+" ")); } System.out.println(); } } /* * Needleman-Wunsch Dynamic Programming Algorithm * see http://amrita.vlab.co.in/?sub=3&brch=274&sim=1431&cnt=1 * Runtime complexity: O(NxM), * Space complexity: O(NxM) where N,M are lengths of the sequences */ public static AlignmentResult computeNWAlignment(String seq1, String seq2, SimpleAlignmentParameters parameters) { //It is easier to index and faster to use //integer matrix for fixed length storage int[][] scoreMatrix = new int[seq1.length()+1][seq2.length()+1]; int gapPenalty=parameters.getGapPenalty(); int substitutePenalty=parameters.getSubstitutePenalty(); AlignmentResult result = new AlignmentResult(); result.setParameters(parameters); //Initialize the score matrix //the first row and column are for the gap //Complexity: O(NxM) for (int i =0; i< seq1.length()+1; i++){ for (int j =0; j< seq2.length()+1; j++){ scoreMatrix[i][j]=0; if (i==0) scoreMatrix[i][j] = gapPenalty*j; else if (j==0) scoreMatrix[i][j] = gapPenalty*i; } } if (verbose){ System.out.println("Initial Matrix"); dumpMatrix(scoreMatrix, "-"+seq1, "-"+seq2); } int similarityCost=0; int matchCost=0; int seq1GapCost=0; int seq2GapCost=0; //Compute the minimum cost scores between all //possible pairs of prefixes //Complexity: O(NxM) for (int i =1; i< seq1.length()+1; i++){ for (int j =1; j< seq2.length()+1; j++){ //Case 1: The cost of mistmatch between the two prefixes similarityCost= (seq2.charAt(j-1)==seq1.charAt(i-1)) ? 0 : substitutePenalty; matchCost = scoreMatrix[i-1][j-1] + similarityCost; //Case 2: the cost of adding a gap on sequence 2 seq2GapCost = scoreMatrix[i-1][j] + gapPenalty; //Case 3: the cost of adding a gap on sequence 1 seq1GapCost = scoreMatrix[i][j-1] + gapPenalty; scoreMatrix[i][j] = Math.min(Math.min(matchCost,seq1GapCost), seq2GapCost); } } if (verbose){ System.out.println("\nFilled Matrix"); dumpMatrix(scoreMatrix, "-"+seq1, "-"+seq2); } //Reconstruct the Alignment by backtracking on //the score matrix //Complexity O(N+M) StringBuilder alignedSequence1= new StringBuilder(); StringBuilder alignedSequence2= new StringBuilder(); int j = seq2.length(); int i = seq1.length(); while (i >0 || j > 0) { if (i>0 && j > 0) similarityCost= (seq2.charAt(j-1)==seq1.charAt(i-1)) ? 0 : substitutePenalty; else similarityCost = Integer.MAX_VALUE; if ( i > 0 && j >0 && scoreMatrix[i][j] == scoreMatrix[i-1][j-1] + similarityCost) { alignedSequence1.append(seq1.charAt(i-1)); alignedSequence2.append(seq2.charAt(j-1)); i=i-1; j=j-1; } else if ( i > 0 && scoreMatrix[i][j] == scoreMatrix[i-1][j] + gapPenalty){ alignedSequence2.append("-"); alignedSequence1.append(seq1.charAt(i-1)); i=i-1; } else if ( j > 0 && scoreMatrix[i][j] == scoreMatrix[i][j-1] + gapPenalty){ alignedSequence1.append("-"); alignedSequence2.append(seq2.charAt(j-1)); j=j-1; } } // end of while result.setTotalCost(scoreMatrix[seq1.length()][seq2.length()]); result.setAlignmentLength(alignedSequence1.length()); result.setAlignments(new String[] {alignedSequence1.reverse().toString(), alignedSequence2.reverse().toString()}); return result; } }

Helper classes: SimpleAlignmentParameters and AlignmentResult:

/* * Alignment parameters used in Needleman-Wunsch Algorithm * with gap penalty */ public class SimpleAlignmentParameters { private int gapPenalty; private int substitutePenalty; public int getGapPenalty() { return gapPenalty; } public void setGapPenalty(int gapPenalty) { this.gapPenalty = gapPenalty; } public int getSubstitutePenalty() { return substitutePenalty; } public void setSubstitutePenalty(int substitutePenalty) { this.substitutePenalty = substitutePenalty; } public SimpleAlignmentParameters(int gapPenalty, int substitutePenalty) { super(); this.gapPenalty = gapPenalty; this.substitutePenalty = substitutePenalty; } public SimpleAlignmentParameters() { super(); this.gapPenalty = 2; this.substitutePenalty = 1; } }

/* * Class that encapsulates results of a binary Alignment: * parameters, edit distance and the actual alignments */ public class AlignmentResult { private int totalCost=0; private int alignmentLength=0; public int getAlignmentLength() { return alignmentLength; } public void setAlignmentLength(int alignmentLength) { this.alignmentLength = alignmentLength; } private int matches=0; private AlignmentParameters parameters=null; private String[] alignments=null; public int getMatches() { return matches; } public void setMatches(int matches) { this.matches = matches; } public int getTotalCost() { return totalCost; } public void setTotalCost(int totalCost) { this.totalCost = totalCost; } public AlignmentParameters getParameters() { return parameters; } public void setParameters(AlignmentParameters parameters) { this.parameters = parameters; } public String[] getAlignments() { return alignments; } public void setAlignments(String[] alignments) { this.alignments = alignments; } private int alignmentScore(String seq1, String seq2){ int totalCost=0; for (int k=0; k < seq1.length(); k++){ if (seq1.charAt(k)!=seq2.charAt(k)) { if ( (seq1.charAt(k)!='-') && (seq2.charAt(k)!='-') ) totalCost++; } if ( (seq1.charAt(k)=='-') || (seq2.charAt(k)=='-') ) { totalCost+=2; } } return totalCost; } }

Here is the result for a pair of DNA sequences where the gap penalty is 2 and substitution penalty is 1:

Test Alignments using Needleman-Wunsch Algorithm Original Sequences: GCGCAATG GCCCTAGCG Initial Matrix - G C G C A A T G - 0 2 4 6 8 10 12 14 16 G 2 0 0 0 0 0 0 0 0 C 4 0 0 0 0 0 0 0 0 C 6 0 0 0 0 0 0 0 0 C 8 0 0 0 0 0 0 0 0 T 10 0 0 0 0 0 0 0 0 A 12 0 0 0 0 0 0 0 0 G 14 0 0 0 0 0 0 0 0 C 16 0 0 0 0 0 0 0 0 G 18 0 0 0 0 0 0 0 0 Filled Matrix - G C G C A A T G - 0 2 4 6 8 10 12 14 16 G 2 0 2 4 6 8 10 12 14 C 4 2 0 2 4 6 8 10 12 C 6 4 2 1 2 4 6 8 10 C 8 6 4 3 1 3 5 7 9 T 10 8 6 5 3 2 4 5 7 A 12 10 8 7 5 3 2 4 6 G 14 12 10 8 7 5 4 3 4 C 16 14 12 10 8 7 6 5 4 G 18 16 14 12 10 9 8 7 5 elapsed time (sec): 0.01765 Alignments: GCGC-AATG size:9 || | | | GCCCTAGCG size:9 Match Score=5, 0.5555556 gaps:1 Edit Distance=5 Alignment Length=9

Here is how the algorithm finds the solution by backtracking the steps on the score matrix:

The algorithm constructs the alignment in reverse order, a diagonal move means match (black arrow) or a mismatch (red arrow) and a vertical or horizontal move (green arrow) means insertion of a gap either on the first (vertical) or the second sequence (horizontal). Note here that, the first sequence is placed horizontally (running from left to right) and the second sequence is placed vertically (running from top to bottom) on the score matrix.

*Question: do you think this is the only optimal alignment for this given pair?*

show answer

Here is the code that creates the result given above:

public static void testNWAlignment(String seq1, String seq2){ NeedlemanWunsch.verbose=true; System.out.println("\n\nTest Alignments using Needleman-Wunsch Algorithm"); System.out.println("Original Sequences:"); System.out.println(seq1); System.out.println(seq2); long start=System.nanoTime(); AlignmentResult result=NeedlemanWunsch.computeNWAlignment(seq1, seq2, new SimpleAlignmentParameters()); long stop=System.nanoTime(); System.out.println( "elapsed time (sec): " + String.format("%2.5f", (float)(stop-start)/Math.pow(10, 9))); String[] alignments= result.getAlignments(); System.out.println("Alignments:"); System.out.println(alignments[0] + " size:" + alignments[0].length()); int matches=0; int gaps=0; for (int k=0; k < alignments[0].length(); k++){ if (alignments[0].charAt(k)==alignments[1].charAt(k)) { matches++; System.out.print('|'); } else System.out.print(" "); if ( (alignments[0].charAt(k)=='-') || (alignments[1].charAt(k)=='-') ) gaps++; } System.out.println(); System.out.println(alignments[1] + " size:" + alignments[1].length()); System.out.println("Match Score=" + matches + ", " + ((float)matches/alignments[0].length())+ " gaps:" + gaps); System.out.println("Edit Distance="+ result.getTotalCost()); System.out.println("Alignment Length="+ result.getAlignmentLength()); }

**Test your understanding of the algorithm **:

-If all you wanted was the optimal alignment score (edit distance, smallest total penalties) but not the aligned sequences themselves, where in the score matrix would you look to retrieve this score?

-How do you prove that the optimal alignment score computed by this algorithm is indeed optimal (ie. the smallest edit distance between the two sequences) ?

-Can you calculate this optimal alignment score without using O(NxM) space for the score matrix (N,M are lengths of the sequences)? How many rows would you need on the score matrix?

**A couple of ideas to explore**:

-How do you modify the algorithm to print out all optimal alignments instead of just one?

-How easy is it to parallelize this algorithm? What would be the runtime/space complexity of such an algorithm?

-How to modify the Needleman-Wunsch algorithm to perform local alignment?

-What happens if you modify the gap and substitution penalties? What are biologically plausible ways to extend these penalties for DNA sequence matching in real applications?

-Can you improve the space complexity of the algorithm to be linear instead of quadratic? (Hint: Look up Hirschberg’s algorithm that was published in 1975)

-Can you extend this algorithm to align 3 DNA sequences at once? How about N DNA sequences?

**Resources:**

- Check out http://amrita.vlab.co.in/?sub=3&brch=274&sim=1431&cnt=1 for more information on the biology side as well as the algorithm itself

-Dynamic programming and sequence alignment : http://www.ibm.com/developerworks/opensource/library/j-seqalign/index.html