A data parallel strategy for aligning multiple biological sequences on multi-core computers

In this paper, we address the large-scale biological sequence alignment problem, which has an increasing demand in computational biology. We employ data parallelism paradigm that is suitable for handling large-scale processing on multi-core computers to achieve a high degree of parallelism. Using the data parallelism paradigm, we propose a general strategy which can be used to speed up any multiple sequence alignment method. We applied five different clustering algorithms in our strategy and implemented rigorous tests on an 8-core computer using four traditional benchmarks and artificially generated sequences. The results show that our multi-core-based implementations can achieve up to 151-fold improvements in execution time while losing 2.19% accuracy on average. The source code of the proposed strategy, together with the test sets used in our analysis, is available on request.

[1]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[2]  Srinivas Aluru,et al.  PARALLEL-TCOFFEE: A parallel multiple sequence aligner , 2007, PDCS.

[3]  Steven Salzberg,et al.  Clustering metagenomic sequences with interpolated Markov models , 2010, BMC Bioinformatics.

[4]  David A. Bader,et al.  On the design of high-performance algorithms for aligning multiple protein sequences on mesh-based multiprocessor architectures , 2007, J. Parallel Distributed Comput..

[5]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[6]  Ashfaq A. Khokhar,et al.  A domain decomposition strategy for alignment of multiple biological sequences on multiprocessor platforms , 2009, J. Parallel Distributed Comput..

[7]  Taeho Kim,et al.  ClustalXeed: a GUI-based grid computation version for high performance and terabyte size multiple sequence alignment , 2010, BMC Bioinformatics.

[8]  Partha Pratim Pande,et al.  Network-on-Chip Hardware Accelerators for Biological Sequence Alignment , 2010, IEEE Transactions on Computers.

[9]  Bharadwaj Veeravalli,et al.  Handling biological sequence alignments on networked computing systems: A divide-and-conquer approach , 2009, J. Parallel Distributed Comput..

[10]  N. Grishin,et al.  MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information , 2006, Nucleic acids research.

[11]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[12]  Kazutaka Katoh,et al.  Parallelization of the MAFFT multiple sequence alignment program , 2010, Bioinform..

[13]  Weiguo Liu,et al.  Streaming Algorithms for Biological Sequence Alignment on GPUs , 2007, IEEE Transactions on Parallel and Distributed Systems.

[14]  Olivier Poch,et al.  BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark , 2005, Proteins.

[15]  Vincent Miele,et al.  Ultra-fast sequence clustering from similarity networks with SiLiX , 2011, BMC Bioinformatics.

[16]  Erik L L Sonnhammer,et al.  Quality assessment of multiple alignment programs , 2002, FEBS letters.

[17]  Fernando Guirado,et al.  Exploiting parallelism on progressive alignment methods , 2011, The Journal of Supercomputing.

[18]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[19]  Donald Geman,et al.  Large-scale integration of cancer microarray data identifies a robust common cancer signature , 2007, BMC Bioinformatics.

[20]  Fernando Guirado,et al.  Cloud-Coffee: implementation of a parallel consistency-based multiple alignment algorithm in the T-Coffee package and its benchmarking on the Amazon Elastic-Cloud , 2010, Bioinform..

[21]  Xiaogang Wang,et al.  A roadmap of clustering algorithms: finding a match for a biomedical application , 2008, Briefings Bioinform..

[22]  Yutaka Saito,et al.  Fast and accurate clustering of noncoding RNAs using ensembles of sequence alignments and secondary structures , 2011, BMC Bioinformatics.

[23]  Azzedine Boukerche,et al.  Parallel Strategies for Local Biological Sequence Alignment in a Cluster of Workstations , 2005, IPDPS.

[24]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[25]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[26]  Azzedine Boukerche,et al.  A Hardware Accelerator for the Fast Retrieval of DIALIGN Biological Sequence Alignments in Linear Space , 2010, IEEE Transactions on Computers.

[27]  Tao Jiang,et al.  SEED: efficient clustering of next-generation sequences , 2011, Bioinform..

[28]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[29]  Folker Meyer,et al.  Rose: generating sequence families , 1998, Bioinform..

[30]  Bharadwaj Veeravalli,et al.  Aligning biological sequences on distributed bus networks: a divisible load scheduling approach , 2005, IEEE Transactions on Information Technology in Biomedicine.

[31]  David L. Millman,et al.  Parallel geometric algorithms for multi-core computers , 2010, Comput. Geom..

[32]  Amitabh Varshney,et al.  High-throughput sequence alignment using Graphics Processing Units , 2007, BMC Bioinformatics.

[33]  Dennis R. Livesay,et al.  Probalign: multiple sequence alignment using partition function posterior probabilities , 2006, Bioinform..

[34]  Gajendra P. S. Raghava,et al.  OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy , 2003, BMC Bioinformatics.

[35]  Oliviero Carugo,et al.  Protein sequence redundancy reduction: comparison of various method , 2010, Bioinformation.

[36]  Kuo-Bin Li,et al.  ClustalW-MPI: ClustalW analysis using distributed and parallel computing , 2003, Bioinform..

[37]  Michael Kaufmann,et al.  DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment , 2008, Algorithms for Molecular Biology.

[38]  Byung-Jun Yoon,et al.  PicXAA: greedy probabilistic construction of maximum expected accuracy alignment of multiple sequences , 2010, Nucleic acids research.

[39]  Shengrui Wang,et al.  CLUSS: Clustering of protein sequences based on a new similarity measure , 2007, BMC Bioinformatics.

[40]  Burkhard Morgenstern,et al.  DIALIGN: multiple DNA and protein sequence alignment at BiBiServ , 2004, Nucleic Acids Res..

[41]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[42]  Kazutaka Katoh,et al.  Recent developments in the MAFFT multiple sequence alignment program , 2008, Briefings Bioinform..

[43]  Mihai Pop,et al.  DNACLUST: accurate and efficient clustering of phylogenetic marker genes , 2011, BMC Bioinformatics.

[44]  Francisco José Esteban,et al.  Next-generation bioinformatics: using many-core processor architecture to develop a web service for sequence alignment , 2010, Bioinform..

[45]  R. Spang,et al.  Estimating amino acid substitution models: a comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood method. , 2002, Molecular biology and evolution.

[46]  Franck Picard,et al.  High-quality sequence clustering guided by network topology and multiple alignment likelihood , 2012, Bioinform..

[47]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..