A novel approach to multiple sequence alignment using hadoop data grids

Multiple alignment of protein sequences is an essential tool in molecular biology. It aids to determine evolutionary linkage and to predict molecular structures. The factors to be considered while aligning multiple sequences are speed and accuracy of alignment. Dynamic programming algorithms like Needleman-Wunsch and Smith-Waterman produce accurate alignments. But these algorithms are computation intensive and are limited to a small number of short sequences. In this paper we propose a time efficient approach to sequence alignment that produces quality alignment. The dynamic nature of the algorithm coupled with data and computational parallelism of hadoop data grids improves the accuracy and speed of sequence alignment. Further due to the scalability of hadoop framework, the proposed multiple sequence alignment is also highly suited for large scale alignment problems.

[1]  Giorgio Valle,et al.  CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment , 2008, BMC Bioinformatics.

[2]  G. Scott Lloyd,et al.  Parallel Multiple Sequence Alignment : An Overview , 2010 .

[3]  S. Rajasekaran,et al.  Randomized And Parallel Algorithms For Distance Matrix Calculations In Multiple Sequence Alignment , 2005, Journal of Clinical Monitoring and Computing.

[4]  Andrew K. C. Wong,et al.  A genetic algorithm for multiple molecular sequence alignment , 1997, Comput. Appl. Biosci..

[5]  Christopher J. Lee,et al.  Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems , 2004, Bioinform..

[6]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[7]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[8]  Yue Lu,et al.  Multiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences , 2007, RECOMB.

[9]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[10]  Michael S. Rosenberg,et al.  Multiple sequence alignment accuracy and evolutionary distance estimation , 2005, BMC Bioinformatics.

[11]  Lode Wyns,et al.  Align-m-a new algorithm for multiple alignment of highly divergent sequences , 2004, Bioinform..

[12]  Bertil Schmidt,et al.  High Speed Biological Sequence Analysis With Hidden Markov Models on Reconfigurable Platforms , 2009, IEEE Transactions on Information Technology in Biomedicine.

[13]  Tahir Naveed,et al.  Parallel Needleman-Wunsch Algorithm for Grid , .

[14]  Christof Teuscher,et al.  Biology Goes Digital: An array of 5,700 Spartan FPGAs brings the BioWall to "life" , 2003 .

[15]  Dennis R. Livesay,et al.  Probalign: multiple sequence alignment using partition function posterior probabilities , 2006, Bioinform..

[16]  Kuo-Bin Li,et al.  ClustalW-MPI: ClustalW analysis using distributed and parallel computing , 2003, Bioinform..

[17]  Satish Chikkagoudar,et al.  eProbalign: generation and manipulation of multiple sequence alignments using partition function posterior probabilities , 2007, Nucleic Acids Res..

[18]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[19]  Ophir Frieder,et al.  Parallel computation in biological sequence analysis , 1998 .

[20]  Ophir Frieder,et al.  Parallel Multiple Sequence Alignment Using Speculative Computation , 1995, ICPP.

[21]  Kevin Truong,et al.  160-fold acceleration of the Smith-Waterman algorithm using a field programmable gate array (FPGA) , 2007, BMC Bioinformatics.

[22]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[23]  Khalid Sayood,et al.  Grammar-based distance in progressive multiple sequence alignment , 2008, BMC Bioinformatics.

[24]  Ian Holmes,et al.  Evolutionary HMMs: a Bayesian approach to multiple alignment , 2001, Bioinform..

[25]  E. Sonnhammer,et al.  Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features , 2008, Nucleic acids research.

[26]  Simon Easteal,et al.  Mind the gaps: evidence of bias in estimates of multiple sequence alignments. , 2007, Molecular biology and evolution.

[27]  Sing-Hoi Sze,et al.  Improving accuracy of multiple sequence alignment algorithms based on alignment of neighboring residues , 2008, Nucleic acids research.

[28]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[29]  Scot E. Dowd,et al.  Windows .NET Network Distributed Basic Local Alignment Search Toolkit (W.ND-BLAST) , 2005, BMC Bioinformatics.

[30]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[31]  Xiandong Meng,et al.  Optimised fine and coarse parallelism for sequence homology search , 2006, Int. J. Bioinform. Res. Appl..

[32]  Thomas L. Casavant,et al.  Three Complementary Approaches to Parallelization of Local BLAST Service on Workstation Clusters (invited paper) , 1999, PaCT.

[33]  Olivier Poch,et al.  BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark , 2005, Proteins.

[34]  Kazutaka Katoh,et al.  PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences , 2007, Bioinform..