Predicting the accuracy of multiple sequence alignment algorithms by using computational intelligent techniques

Multiple sequence alignments (MSAs) have become one of the most studied approaches in bioinformatics to perform other outstanding tasks such as structure prediction, biological function analysis or next-generation sequencing. However, current MSA algorithms do not always provide consistent solutions, since alignments become increasingly difficult when dealing with low similarity sequences. As widely known, these algorithms directly depend on specific features of the sequences, causing relevant influence on the alignment accuracy. Many MSA tools have been recently designed but it is not possible to know in advance which one is the most suitable for a particular set of sequences. In this work, we analyze some of the most used algorithms presented in the bibliography and their dependences on several features. A novel intelligent algorithm based on least square support vector machine is then developed to predict how accurate each alignment could be, depending on its analyzed features. This algorithm is performed with a dataset of 2180 MSAs. The proposed system first estimates the accuracy of possible alignments. The most promising methodologies are then selected in order to align each set of sequences. Since only one selected algorithm is run, the computational time is not excessively increased.

[1]  R. Shah,et al.  Least Squares Support Vector Machines , 2022 .

[2]  Lior Pachter,et al.  Fast Statistical Alignment , 2009, PLoS Comput. Biol..

[3]  Johan A. K. Suykens,et al.  LS-SVMlab : a MATLAB / C toolbox for Least Squares Support Vector Machines , 2007 .

[4]  Kenji Mizuguchi,et al.  HOMSTRAD: recent developments of the Homologous Protein Structure Alignment Database , 2004, Nucleic Acids Res..

[5]  Ellen J. Bass,et al.  Improving pairwise sequence alignment accuracy using near-optimal protein sequence alignments , 2010, BMC Bioinformatics.

[6]  Emily Dimmer,et al.  The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology , 2004, Nucleic Acids Res..

[7]  Tandy J. Warnow,et al.  The Impact of Multiple Protein Sequence Alignment on Phylogenetic Estimation , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[8]  Christian M. Reidys,et al.  RNA-RNA interaction prediction based on multiple sequence alignments , 2010, Bioinform..

[9]  Narayanaswamy Srinivasan,et al.  iPBA: a tool for protein structure comparison using sequence alignment strategies , 2011, Nucleic Acids Res..

[10]  Olivier Poch,et al.  MACSIMS : multiple alignment of complete sequences information management system , 2006, BMC Bioinformatics.

[11]  Cédric Notredame,et al.  3DCoffee: combining protein sequences and structures within multiple sequence alignments. , 2004, Journal of molecular biology.

[12]  David A. Fenstermacher,et al.  Introduction to bioinformatics , 2005, J. Assoc. Inf. Sci. Technol..

[13]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[14]  Michel Verleysen,et al.  Mutual information for the selection of relevant variables in spectrometric nonlinear modelling , 2006, ArXiv.

[15]  Etsuko N. Moriyama,et al.  SuiteMSA: visual tools for multiple sequence alignment comparison and molecular sequence simulation , 2011, BMC Bioinformatics.

[16]  Erik L. L. Sonnhammer,et al.  Kalign – an accurate and fast multiple sequence alignment algorithm , 2005, BMC Bioinformatics.

[17]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[18]  Xiaomei Wu,et al.  Prediction of yeast protein–protein interaction network: insights from the Gene Ontology and annotations , 2006, Nucleic acids research.

[19]  Evgueni A. Haroutunian,et al.  Information Theory and Statistics , 2011, International Encyclopedia of Statistical Science.

[20]  Elisabeth R. M. Tillier,et al.  The accuracy of several multiple sequence alignment programs for proteins , 2006, BMC Bioinformatics.

[21]  J. Pei,et al.  Multiple protein sequence alignment. , 2008, Current opinion in structural biology.

[22]  Heng Li,et al.  A survey of sequence alignment algorithms for next-generation sequencing , 2010, Briefings Bioinform..

[23]  Johan A. K. Suykens,et al.  Support Vector Machines : Least Squares Approaches and Extensions , 2003 .

[24]  Udi Manber,et al.  Fast text searching: allowing errors , 1992, CACM.

[25]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[26]  M. Hestenes,et al.  Methods of conjugate gradients for solving linear systems , 1952 .

[27]  Octavia I. Camps,et al.  Weighted Parzen Windows for Pattern Classification , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[29]  Jimin Pei,et al.  PROMALS: towards accurate multiple sequence alignments of distantly related proteins , 2007, Bioinform..

[30]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[31]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[32]  Olivier Poch,et al.  BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark , 2005, Proteins.

[33]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[34]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[35]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[36]  Yongchao Liu,et al.  MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities , 2010, Bioinform..

[37]  István Miklós,et al.  Reticular alignment: A progressive corner-cutting method for multiple sequence alignment , 2010, BMC Bioinformatics.

[38]  Hongye Su,et al.  Sparse representation based on projection method in online least squares support vector machines , 2009 .

[39]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Bruce A. Draper,et al.  Feature selection from huge feature sets , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[41]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[42]  Olivier Poch,et al.  AlexSys: a knowledge-based expert system for multiple sequence alignment construction and analysis , 2010, Nucleic acids research.

[43]  Cédric Notredame,et al.  Upcoming challenges for multiple sequence alignment methods in the high-throughput era , 2009, Bioinform..

[44]  Zalmiyah Zakaria,et al.  Utilizing shared interacting domain patterns and Gene Ontology information to improve protein-protein interaction prediction , 2010, Comput. Biol. Medicine.

[45]  Gajendra P. S. Raghava,et al.  OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy , 2003, BMC Bioinformatics.

[46]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[47]  Marek Kimmel,et al.  Prediction of missense mutation functionality depends on both the algorithm and sequence alignment employed , 2011, Human mutation.

[48]  Jacek M. Zurada,et al.  Normalized Mutual Information Feature Selection , 2009, IEEE Transactions on Neural Networks.