Multiobjective characteristic-based framework for very-large multiple sequence alignment

Abstract In the literature, we can find several heuristics for solving the multiple sequence alignment problem. The vast majority of them makes use of flags in order to modify certain alignment parameters; however, if no flags are used, the aligner will run with the default parameter configuration, which, often, is not the optimal one. In this work, we propose a framework that, depending on the biological characteristics of the input dataset, runs the aligner with the best parameter configuration found for another dataset that has similar biological characteristics, improving the accuracy and conservation of the obtained alignment. To train the framework, we use three well-known multiobjective evolutionary algorithms: NSGA-II, IBEA, and MOEA/D. Then, we perform a comparative study between several aligners proposed in the literature and the characteristic-based version of Kalign, MAFFT, and MUSCLE, when solving widely-used benchmarks (PREFAB v4.0 and SABmark v1.65) and very-large benchmarks with thousands of unaligned sequences (HomFam).

[1]  Dennis R. Livesay,et al.  Probalign: multiple sequence alignment using partition function posterior probabilities , 2006, Bioinform..

[2]  Lior Pachter,et al.  Fast Statistical Alignment , 2009, PLoS Comput. Biol..

[3]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[4]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[5]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[6]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[7]  Michael Kaufmann,et al.  DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment , 2008, Algorithms for Molecular Biology.

[8]  W. Taylor,et al.  The classification of amino acid conservation. , 1986, Journal of theoretical biology.

[9]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[10]  Ruhul A. Sarker,et al.  Progressive Alignment Method Using Genetic Algorithm for Multiple Sequence Alignment , 2012, IEEE Transactions on Evolutionary Computation.

[11]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[12]  Lothar Thiele,et al.  Comparison of Multiobjective Evolutionary Algorithms: Empirical Results , 2000, Evolutionary Computation.

[13]  R. Doolittle Similar amino acid sequences: chance or common ancestry? , 1981, Science.

[14]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[15]  E. Sonnhammer,et al.  Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features , 2008, Nucleic acids research.

[16]  C. Notredame,et al.  Recent progress in multiple sequence alignment: a survey. , 2002, Pharmacogenomics.

[17]  W. A. Beyer,et al.  Some Biological Sequence Metrics , 1976 .

[18]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[19]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[20]  Qingfu Zhang,et al.  A multiobjective evolutionary algorithm based on decomposition with normal boundary intersection for traffic grooming in optical networks , 2014, Inf. Sci..

[21]  Tandy J. Warnow,et al.  PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences , 2015, J. Comput. Biol..

[22]  John P. Overington,et al.  HOMSTRAD: A database of protein structure alignments for homologous families , 1998, Protein science : a publication of the Protein Society.

[23]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[24]  Olivier Poch,et al.  BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark , 2005, Proteins.

[25]  Desmond G. Higgins,et al.  Sequence embedding for fast construction of guide trees for multiple sequence alignment , 2010, Algorithms for Molecular Biology.

[26]  Lode Wyns,et al.  SABmark- a benchmark for sequence alignment that covers the entire known fold space , 2005, Bioinform..

[27]  Miguel A. Vega-Rodríguez,et al.  A Hybrid Multiobjective Memetic Metaheuristic for Multiple Sequence Alignment , 2016, IEEE Transactions on Evolutionary Computation.

[28]  Maurits J. J. Dijkstra,et al.  Multiple Sequence Alignment. , 2017, Methods in molecular biology.

[29]  Miguel A. Vega-Rodríguez,et al.  Applying MOEAs to solve the static Routing and Wavelength Assignment problem in optical WDM networks , 2013, Eng. Appl. Artif. Intell..

[30]  Douglas L. Brutlag,et al.  Development and validation of a consistency based multiple structure alignment algorithm , 2006, Bioinform..

[31]  R. Solovay A model of set-theory in which every set of reals is Lebesgue measurable* , 1970 .

[32]  Yongchao Liu,et al.  MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities , 2010, Bioinform..

[33]  Cédric Notredame,et al.  Multiple sequence alignment modeling: methods and applications , 2016, Briefings Bioinform..

[34]  Ruhul A. Sarker,et al.  Vertical decomposition with Genetic Algorithm for Multiple Sequence Alignment , 2011, BMC Bioinformatics.

[35]  Eckart Zitzler,et al.  Indicator-Based Selection in Multiobjective Search , 2004, PPSN.

[36]  P. Hogeweg,et al.  The alignment of sets of sequences and the construction of phyletic trees: An integrated method , 2005, Journal of Molecular Evolution.

[37]  D. Bacon,et al.  Multiple Sequence Alignment , 1986, Journal of molecular biology.

[38]  Lothar Thiele,et al.  Multiobjective Optimization Using Evolutionary Algorithms - A Comparative Case Study , 1998, PPSN.

[39]  Miguel A. Vega-Rodríguez,et al.  Hybrid multiobjective artificial bee colony for multiple sequence alignment , 2016, Appl. Soft Comput..

[40]  Ari Löytynoja,et al.  An algorithm for progressive multiple alignment of sequences with insertions. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[41]  M. Kimura A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences , 1980, Journal of Molecular Evolution.

[42]  Lothar Thiele,et al.  Multiobjective evolutionary algorithms: a comparative case study and the strength Pareto approach , 1999, IEEE Trans. Evol. Comput..

[43]  Desmond G. Higgins,et al.  Instability in progressive multiple sequence alignment algorithms , 2015, Algorithms for Molecular Biology.

[44]  Héctor Pomares,et al.  Optimizing multiple sequence alignments using a genetic algorithm based on three objectives: structural information, non-gaps percentage and totally conserved columns , 2013, Bioinform..

[45]  Qingfu Zhang,et al.  MOEA/D: A Multiobjective Evolutionary Algorithm Based on Decomposition , 2007, IEEE Transactions on Evolutionary Computation.

[46]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[47]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[48]  Gautam B. Singh Multiple Sequence Alignment , 2015 .

[49]  R. Doolittle,et al.  Progressive sequence alignment as a prerequisitetto correct phylogenetic trees , 2007, Journal of Molecular Evolution.