论文信息 - A data parallel strategy for aligning multiple biological sequences on multi-core computers - 字舞流文

A data parallel strategy for aligning multiple biological sequences on multi-core computers

In this paper, we address the large-scale biological sequence alignment problem, which has an increasing demand in computational biology. We employ data parallelism paradigm that is suitable for handling large-scale processing on multi-core computers to achieve a high degree of parallelism. Using the data parallelism paradigm, we propose a general strategy which can be used to speed up any multiple sequence alignment method. We applied five different clustering algorithms in our strategy and implemented rigorous tests on an 8-core computer using four traditional benchmarks and artificially generated sequences. The results show that our multi-core-based implementations can achieve up to 151-fold improvements in execution time while losing 2.19% accuracy on average. The source code of the proposed strategy, together with the test sets used in our analysis, is available on request.

Kenli Li | Xiangyuan Zhu | Ahmad Salah | Kenli Li | Ahmad Salah | Xiangyuan Zhu

[1] J. Thompson,et al. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[2] Srinivas Aluru,et al. PARALLEL-TCOFFEE: A parallel multiple sequence aligner , 2007, PDCS.

[3] Steven Salzberg,et al. Clustering metagenomic sequences with interpolated Markov models , 2010, BMC Bioinformatics.

[4] David A. Bader,et al. On the design of high-performance algorithms for aligning multiple protein sequences on mesh-based multiprocessor architectures , 2007, J. Parallel Distributed Comput..

[5] Robert C. Edgar,et al. MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[6] Ashfaq A. Khokhar,et al. A domain decomposition strategy for alignment of multiple biological sequences on multiprocessor platforms , 2009, J. Parallel Distributed Comput..

[7] Taeho Kim,et al. ClustalXeed: a GUI-based grid computation version for high performance and terabyte size multiple sequence alignment , 2010, BMC Bioinformatics.

[8] Partha Pratim Pande,et al. Network-on-Chip Hardware Accelerators for Biological Sequence Alignment , 2010, IEEE Transactions on Computers.

[9] Bharadwaj Veeravalli,et al. Handling biological sequence alignments on networked computing systems: A divide-and-conquer approach , 2009, J. Parallel Distributed Comput..

[10] N. Grishin,et al. MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information , 2006, Nucleic acids research.

[11] Zhengwei Zhu,et al. CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[12] Kazutaka Katoh,et al. Parallelization of the MAFFT multiple sequence alignment program , 2010, Bioinform..

[13] Weiguo Liu,et al. Streaming Algorithms for Biological Sequence Alignment on GPUs , 2007, IEEE Transactions on Parallel and Distributed Systems.

[14] Olivier Poch,et al. BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark , 2005, Proteins.

[15] Vincent Miele,et al. Ultra-fast sequence clustering from similarity networks with SiLiX , 2011, BMC Bioinformatics.

[16] Erik L L Sonnhammer,et al. Quality assessment of multiple alignment programs , 2002, FEBS letters.

[17] Fernando Guirado,et al. Exploiting parallelism on progressive alignment methods , 2011, The Journal of Supercomputing.

[18] D. Higgins,et al. T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[19] Donald Geman,et al. Large-scale integration of cancer microarray data identifies a robust common cancer signature , 2007, BMC Bioinformatics.

[20] Fernando Guirado,et al. Cloud-Coffee: implementation of a parallel consistency-based multiple alignment algorithm in the T-Coffee package and its benchmarking on the Amazon Elastic-Cloud , 2010, Bioinform..

[21] Xiaogang Wang,et al. A roadmap of clustering algorithms: finding a match for a biomedical application , 2008, Briefings Bioinform..

[22] Yutaka Saito,et al. Fast and accurate clustering of noncoding RNAs using ensembles of sequence alignments and secondary structures , 2011, BMC Bioinformatics.

[23] Azzedine Boukerche,et al. Parallel Strategies for Local Biological Sequence Alignment in a Cluster of Workstations , 2005, IPDPS.

[24] S. B. Needleman,et al. A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[25] Chuong B. Do,et al. ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[26] Azzedine Boukerche,et al. A Hardware Accelerator for the Fast Retrieval of DIALIGN Biological Sequence Alignments in Linear Space , 2010, IEEE Transactions on Computers.

[27] Tao Jiang,et al. SEED: efficient clustering of next-generation sequences , 2011, Bioinform..

[28] Robert C. Edgar,et al. BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[29] Folker Meyer,et al. Rose: generating sequence families , 1998, Bioinform..

[30] Bharadwaj Veeravalli,et al. Aligning biological sequences on distributed bus networks: a divisible load scheduling approach , 2005, IEEE Transactions on Information Technology in Biomedicine.

[31] David L. Millman,et al. Parallel geometric algorithms for multi-core computers , 2010, Comput. Geom..

[32] Amitabh Varshney,et al. High-throughput sequence alignment using Graphics Processing Units , 2007, BMC Bioinformatics.

[33] Dennis R. Livesay,et al. Probalign: multiple sequence alignment using partition function posterior probabilities , 2006, Bioinform..

[34] Gajendra P. S. Raghava,et al. OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy , 2003, BMC Bioinformatics.

[35] Oliviero Carugo,et al. Protein sequence redundancy reduction: comparison of various method , 2010, Bioinformation.

[36] Kuo-Bin Li,et al. ClustalW-MPI: ClustalW analysis using distributed and parallel computing , 2003, Bioinform..

[37] Michael Kaufmann,et al. DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment , 2008, Algorithms for Molecular Biology.

[38] Byung-Jun Yoon,et al. PicXAA: greedy probabilistic construction of maximum expected accuracy alignment of multiple sequences , 2010, Nucleic acids research.

[39] Shengrui Wang,et al. CLUSS: Clustering of protein sequences based on a new similarity measure , 2007, BMC Bioinformatics.

[40] Burkhard Morgenstern,et al. DIALIGN: multiple DNA and protein sequence alignment at BiBiServ , 2004, Nucleic Acids Res..

[41] William R. Taylor,et al. The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[42] Kazutaka Katoh,et al. Recent developments in the MAFFT multiple sequence alignment program , 2008, Briefings Bioinform..

[43] Mihai Pop,et al. DNACLUST: accurate and efficient clustering of phylogenetic marker genes , 2011, BMC Bioinformatics.

[44] Francisco José Esteban,et al. Next-generation bioinformatics: using many-core processor architecture to develop a web service for sequence alignment , 2010, Bioinform..

[45] R. Spang,et al. Estimating amino acid substitution models: a comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood method. , 2002, Molecular biology and evolution.

[46] Franck Picard,et al. High-quality sequence clustering guided by network topology and multiple alignment likelihood , 2012, Bioinform..

[47] Adam Godzik,et al. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..