ProbPFP: A Multiple Sequence Alignment Algorithm Combining Partition Function and Hidden Markov Model with Particle Swarm Optimization

The substitution score for pairwise sequence alignment is essential in conducting multiple sequence alignment (MSA). The Hidden Markov Model (HMM) and partition function are two methods that are widely chosen for this purpose. Recent studies showed that the accuracy of alignment could be improved by combining the partition function and HMM algorithms or optimizing the parameters of HMM. However, the combination of optimized HMM and partition function, which could greatly improve the accuracy of alignment, was ignored in these studies. This study presents a new MSA algorithm known as ProbPFP that combines the partition function and the HMM optimized by particle swarm optimization (PSO). In this work, the parameters of HMM were first optimized by the PSO algorithm, and the posterior probabilities derived from the HMM were subsequently combined with the results derived from the partition function to compute a comprehensive substitution score for alignment. To assess the effectiveness, ProbPFP was compared with 13 leading aligners, namely, Probalign, CONTRAlign, ProbCons, MUSCLE, MAFFT, COBALT, T-Coffee, ClustalΩ, ClustalW, DIALIGN, PicXAA, Align-m and KALIGN2. The results showed that ProbPFP achieved the highest average sum-of-pairs (SP) scores (0.9015, 0.5984) and average total column (TC) scores (0.8170, 0.3956) on two benchmark sets OXBench and SABmark, as well as the second highest average SP score (0.8250) and average TC score (0.6703) on the benchmark set BAliBASE. We also used the alignments generated by ProbPFP and 4 other leading aligners to rebuild the phylogenetic trees of 6 families from the TreeFam database. The result suggests that the trees from the alignments generated by ProbPFP are closer to the reference trees.

[1]  Yongchao Liu,et al.  MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities , 2010, Bioinform..

[2]  Cédric Notredame,et al.  Multiple sequence alignment modeling: methods and applications , 2016, Briefings Bioinform..

[3]  Serafim Batzoglou,et al.  CONTRAlign: Discriminative Training for Protein Sequence Alignment , 2006, RECOMB.

[4]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[5]  Thomas Kiel Rasmussen,et al.  Improved Hidden Markov Model training for multiple sequence alignment by a particle swarm optimization-evolutionary algorithm hybrid. , 2003, Bio Systems.

[6]  Siu-Ming Yiu,et al.  GLProbs: aligning multiple sequences adaptively , 2015, TCBB.

[7]  Patrice Koehl,et al.  MAO: a Multiple Alignment Ontology for nucleic acid and protein sequences , 2005, Nucleic acids research.

[8]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[9]  M. Nei,et al.  MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. , 2011, Molecular biology and evolution.

[10]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[11]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[12]  Xingji Li,et al.  Defect evolution in ZnO and its effect on radiation tolerance. , 2018, Physical chemistry chemical physics : PCCP.

[13]  Xingji Li,et al.  Bright room temperature single photon source at telecom range in cubic silicon carbide , 2018, Nature Communications.

[14]  Moon-Jung Chung,et al.  Multiple sequence alignment using simulated annealing , 1994, Comput. Appl. Biosci..

[15]  Olivier Poch,et al.  GOAnno: GO annotation based on multiple alignment , 2005, Bioinform..

[16]  Xiaojun Wu,et al.  Multiple Sequence Alignment with Hidden Markov Models Learned by Random Drift Particle Swarm Optimization , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[17]  Jie Sun,et al.  DincRNA: a comprehensive web-based bioinformatics toolkit for exploring disease associations and ncRNA function , 2018, Bioinform..

[18]  Jiajie Peng,et al.  InfAcrOnt: calculating cross-ontology term similarities using information flow by a random walk , 2018, BMC Genomics.

[19]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[20]  Wanying Xu,et al.  OAHG: an integrated resource for annotating human genes with multi-level ontologies , 2016, Scientific Reports.

[21]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[22]  Yadong Wang,et al.  Identifying term relations cross different gene ontology categories , 2017, BMC Bioinformatics.

[23]  Meng Zhou,et al.  MetSigDis: a manually curated resource for the metabolic signatures of diseases , 2019, Briefings Bioinform..

[24]  Yue Jiang,et al.  DisSim: an online system for exploring significant similar diseases and exhibiting potential therapeutic drugs , 2016, Scientific Reports.

[25]  Jiajie Peng,et al.  Measuring phenotype-phenotype similarity through the interactome , 2017, BIBM.

[26]  Byung-Jun Yoon,et al.  PicXAA: greedy probabilistic construction of maximum expected accuracy alignment of multiple sequences , 2010, Nucleic acids research.

[27]  Xiaojun Wu,et al.  Multiple sequence alignment using the Hidden Markov Model trained by an improved quantum-behaved particle swarm optimization , 2012, Inf. Sci..

[28]  Burkhard Morgenstern,et al.  DIALIGN at GOBICS—multiple sequence alignment using various sources of external information , 2013, Nucleic Acids Res..

[29]  Yadong Wang,et al.  A novel method to measure the semantic similarity of HPO terms , 2017, Int. J. Data Min. Bioinform..

[30]  Qinghua Guo,et al.  LncRNA2Target v2.0: a comprehensive database for target genes of lncRNAs in human and mouse , 2018, Nucleic Acids Res..

[31]  Olivier Poch,et al.  BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs , 1999, Bioinform..

[32]  Lode Wyns,et al.  Align-m-a new algorithm for multiple alignment of highly divergent sequences , 2004, Bioinform..

[33]  Riccardo Poli,et al.  Particle swarm optimization , 1995, Swarm Intelligence.

[34]  Yu Zhang,et al.  Multiple Sequence Alignment Based on Profile Hidden Markov Model and Quantum-Behaved Particle Swarm Optimization with Selection Method , 2011 .

[35]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[36]  Liang Cheng,et al.  Measuring disease similarity and predicting disease-related ncRNAs by a novel method , 2017, BMC Medical Genomics.

[37]  E. Sonnhammer,et al.  Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features , 2008, Nucleic acids research.

[38]  Richa Agarwala,et al.  COBALT: constraint-based alignment tool for multiple protein sequences , 2007, Bioinform..

[39]  Jun Zhang,et al.  Identifying diseases-related metabolites using random walk , 2018, BMC Bioinformatics.

[40]  Liang Cheng,et al.  Rs4878104 contributes to Alzheimer’s disease risk and regulates DAPK1 gene expression , 2017, Neurological Sciences.

[41]  M. Ruggero,et al.  Similarity of Traveling-Wave Delays in the Hearing Organs of Humans and Other Tetrapods , 2007, Journal for the Association for Research in Otolaryngology.

[42]  Dennis R. Livesay,et al.  Probalign: multiple sequence alignment using partition function posterior probabilities , 2006, Bioinform..

[43]  Shuhui Liu,et al.  Improving the measurement of semantic similarity by combining gene ontology and co-functional network: a random walk based approach , 2018, BMC Systems Biology.

[44]  Gajendra P. S. Raghava,et al.  OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy , 2003, BMC Bioinformatics.

[45]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[46]  Liang Cheng,et al.  GAB2 rs2373115 variant contributes to Alzheimer's disease risk specifically in European population , 2017, Journal of the Neurological Sciences.