Statistical compression of protein sequences and inference of marginal probability landscapes over competing alignments using finite state models and Dirichlet priors

Abstract The information criterion of minimum message length (MML) provides a powerful statistical framework for inductive reasoning from observed data. We apply MML to the problem of protein sequence comparison using finite state models with Dirichlet distributions. The resulting framework allows us to supersede the ad hoc cost functions commonly used in the field, by systematically addressing the problem of arbitrariness in alignment parameters, and the disconnect between substitution scores and gap costs. Furthermore, our framework enables the generation of marginal probability landscapes over all possible alignment hypotheses, with potential to facilitate the users to simultaneously rationalize and assess competing alignment relationships between protein sequences, beyond simply reporting a single (best) alignment. We demonstrate the performance of our program on benchmarks containing distantly related protein sequences. Availability and implementation The open-source program supporting this work is available from: http://lcb.infotech.monash.edu.au/seqmmligner. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  G J Barton,et al.  Evaluation and improvements in the automatic alignment of protein sequences. , 1987, Protein engineering.

[2]  N. Sloane,et al.  On the Voronoi Regions of Certain Lattices , 1984 .

[3]  Erik L. L. Sonnhammer,et al.  Kalign – an accurate and fast multiple sequence alignment algorithm , 2005, BMC Bioinformatics.

[4]  F E Cohen,et al.  Pairwise sequence alignment below the twilight zone. , 2001, Journal of molecular biology.

[5]  Thomas D. Cuypers,et al.  Iterative orthology prediction uncovers new mitochondrial proteins and identifies C12orf62 as the human ortholog of COX14, a protein involved in the assembly of cytochrome c oxidase , 2012, Genome Biology.

[6]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[7]  Michael S. Rosenberg,et al.  Sequence alignment : methods, models, concepts, and strategies , 2009 .

[8]  R. Doolittle Of urfs and orfs : a primer on how to analyze devised amino acid sequences , 1986 .

[9]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[10]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[11]  C. S. Wallace,et al.  Coding Decision Trees , 1993, Machine Learning.

[12]  Trevor I. Dix,et al.  Compression and Approximate Matching , 1999, Comput. J..

[13]  T. Smith,et al.  Optimal sequence alignments. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Trevor I. Dix,et al.  Modelling-Alignment for Non-random Sequences , 2004, Australian Conference on Artificial Intelligence.

[15]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[16]  P. Argos,et al.  An assessment of amino acid exchange matrices in aligning protein sequences: the twilight zone revisited. , 1995, Journal of molecular biology.

[17]  Serafim Batzoglou,et al.  CONTRAlign: Discriminative Training for Protein Sequence Alignment , 2006, RECOMB.

[18]  Steven A Benner,et al.  Empirical analysis of protein insertions and deletions determining parameters for the correct placement of gaps in protein sequence alignments. , 2004, Journal of molecular biology.

[19]  Arthur M. Lesk,et al.  Introduction to Protein Science: Architecture, Function, and Genomics , 2001 .

[20]  C. S. Wallace,et al.  Statistical and Inductive Inference by Minimum Message Length (Information Science and Statistics) , 2005 .

[21]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[23]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[24]  C. S. Wallace,et al.  Estimation and Inference by Compact Coding , 1987 .

[25]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[26]  M S Waterman,et al.  Sequence alignment and penalty choice. Review of concepts, case studies and implications. , 1994, Journal of molecular biology.

[27]  Tal Pupko,et al.  A Simulation‐Based Approach to Statistical Alignment , 2018, Systematic biology.

[28]  G. Gonnet,et al.  Empirical and structural models for insertions and deletions in the divergent evolution of proteins. , 1993, Journal of molecular biology.

[29]  P. Argos,et al.  Analysis of insertions/deletions in protein structures. , 1992, Journal of molecular biology.

[30]  A. Löytynoja,et al.  Phylogeny-Aware Gap Placement Prevents Errors in Sequence Alignment and Evolutionary Analysis , 2008, Science.

[31]  K. Katoh,et al.  MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability , 2013, Molecular biology and evolution.

[32]  Andrei N. Kolmogorov,et al.  On Tables of Random Numbers (Reprinted from "Sankhya: The Indian Journal of Statistics", Series A, Vol. 25 Part 4, 1963) , 1998, Theor. Comput. Sci..

[33]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[34]  Peter Grünwald,et al.  Invited review of the book Statistical and Inductive Inference by Minimum Message Length , 2006 .

[35]  Lloyd Allison,et al.  Coding Ockham's Razor , 2018, Springer International Publishing.

[36]  W. C. Barker Of URFs and ORFs: A primer on how to analyze derived amino acid sequences: Russell F. Doolittle, University Science Books, Mill Valley, CA. Paperback. Under $15 , 1987 .

[37]  Jun Zhu,et al.  Bayesian adaptive sequence alignment algorithms , 1998, Bioinform..

[38]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[39]  Lode Wyns,et al.  SABmark- a benchmark for sequence alignment that covers the entire known fold space , 2005, Bioinform..

[40]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[41]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[42]  Elena Rivas,et al.  Parameterizing sequence alignment with an explicit evolutionary model , 2015, BMC Bioinformatics.

[43]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[44]  Peter J. Stuckey,et al.  Statistical inference of protein structural alignments using information and compression , 2016, bioRxiv.

[45]  M. Suchard,et al.  Joint Bayesian estimation of alignment and phylogeny. , 2005, Systematic biology.

[46]  Lloyd Allison,et al.  The Bits Between Proteins , 2018, 2018 Data Compression Conference.

[47]  G. Gonnet,et al.  Exhaustive matching of the entire protein sequence database. , 1992, Science.

[48]  C. S. Wallace,et al.  Finite-state models in the alignment of macromolecules , 1992, Journal of Molecular Evolution.

[49]  Cristina Cattaneo,et al.  Introduction to genomics. , 2012, Methods in molecular biology.