PALM: Probabilistic area loss Minimization for Protein Sequence Alignment

Protein sequence alignment is a fundamental problem in computational structure biology and popular for protein 3D structural prediction and protein homology detection. Most of the developed programs for detecting protein sequence alignments are based upon the likelihood information of amino acids and are sensitive to alignment noises. We present a robust method PALM for modeling pairwise protein structure alignments, using the area distance to reduce the biological measurement noise. PALM generatively learn the alignment of two protein sequences with probabilistic area distance objective, which can denoise the measurement errors and offsets from different biologists. During learning, we show that the optimization is computationally efficient by estimating the gradients via dynamically sampling alignments. Empirically, we show that PALM can generate sequence alignments with higher precision and recall, as well as smaller area distance than the competing methods especially for long protein sequences and remote homologies. This study implies for learning over large-scale protein sequence alignment problems, one could potentially give PALM a try.

[1]  SödingJohannes Protein homology detection by HMM--HMM comparison , 2005 .

[2]  S F Altschul,et al.  Protein database searches for multiple alignments. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[4]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[5]  Martin Tompa,et al.  Lecture Notes on Biological Sequence Analysis 1 , 2000 .

[6]  Thomas A. Hopf,et al.  Protein 3D Structure Computed from Evolutionary Sequence Variation , 2011, PloS one.

[7]  Jian Peng,et al.  A conditional neural fields model for protein threading , 2012, Bioinform..

[8]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[9]  Jinbo Xu,et al.  Deep template-based protein structure prediction , 2020, bioRxiv.

[10]  Sivaraman Balakrishnan,et al.  Learning generative models for protein fold families , 2011, Proteins.

[11]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[12]  Kazutaka Katoh,et al.  Recent developments in the MAFFT multiple sequence alignment program , 2008, Briefings Bioinform..

[13]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[14]  Jianzhu Ma,et al.  Protein structure alignment beyond spatial proximity , 2013, Scientific Reports.

[15]  Fabrice Armougom,et al.  Expresso: automatic incorporation of structural information in multiple sequence alignments using 3D-Coffee , 2006, Nucleic Acids Res..

[16]  Lenore Cowen,et al.  MRFy: Remote Homology Detection for Beta-Structural Proteins Using Markov Random Fields and Stochastic Search , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[17]  Johannes Söding,et al.  The HHpred interactive server for protein homology detection and structure prediction , 2005, Nucleic Acids Res..

[18]  Chan-seok Jeong,et al.  Structure-based Markov random field model for representing evolutionary constraints on functional sites , 2016, BMC Bioinformatics.

[19]  Zhiyong Wang,et al.  MRFalign: Protein Homology Detection through Alignment of Markov Random Fields , 2014, PLoS Comput. Biol..

[20]  Sudhir Kumar,et al.  MEGA3: Integrated software for Molecular Evolutionary Genetics Analysis and sequence alignment , 2004, Briefings Bioinform..

[21]  N. Higham,et al.  Accurately computing the log-sum-exp and softmax functions , 2020, IMA Journal of Numerical Analysis.

[22]  Haizhou Li,et al.  Acoustic Modeling for Automatic Lyrics-to-Audio Alignment , 2019, INTERSPEECH.

[23]  Wu-chun Feng,et al.  AAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-Based Multi-and Many-Core Processors , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[24]  Serafim Batzoglou,et al.  CONTRAlign: Discriminative Training for Protein Sequence Alignment , 2006, RECOMB.

[25]  Damien Garreau,et al.  Metric Learning for Temporal Sequence Alignment , 2014, NIPS.

[26]  Bai Jiang,et al.  Convergence of contrastive divergence algorithm in exponential family , 2016, The Annals of Statistics.

[27]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[28]  Anil K. Jain,et al.  Markov Random Field Texture Models , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.