Using mixtures of common ancestors for estimating the probabilities of discrete events in biological sequences

Accurately estimating probabilities from observations is important for probabilistic-based approaches to problems in computational biology. In this paper we present a biologically-motivated method for estimating probability distributions over discrete alphabets from observations using a mixture model of common ancestors. The method is an extension of substitution matrix-based probability estimation methods. In contrast to previous such methods, our method has a simple Bayesian interpretation and has the advantage over Dirichlet mixtures that it is both effective and simple to compute for large alphabets. The method is applied to estimate amino acid probabilities based on observed counts in an alignment and is shown to perform comparably to previous methods. The method is also applied to estimate probability distributions over protein families and improves protein classification accuracy.

[1]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[2]  M. Degroot Optimal Statistical Decisions , 1970 .

[3]  M. Gribskov,et al.  [9] Profile analysis , 1990 .

[4]  S. Henikoff,et al.  Automated assembly of protein blocks for database searching. , 1991, Nucleic acids research.

[5]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[6]  David Haussler,et al.  Using Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Families , 1993, ISMB.

[7]  M. Gribskov,et al.  Profile Analysis , 1970 .

[8]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[9]  M. A. McClure,et al.  Hidden Markov models of biological primary sequence information. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[10]  S. Altschul,et al.  Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Jean-Michel Claverie,et al.  Some Useful Statistical Properties of Position-weight Matrices , 1994, Comput. Chem..

[12]  S. Henikoff,et al.  Position-based sequence weights. , 1994, Journal of molecular biology.

[13]  Sean R. Eddy,et al.  Multiple Alignment Using Hidden Markov Models , 1995, ISMB.

[14]  Frans M. J. Willems,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[15]  K. Karplus REGULARIZERS FOR ESTIMATING DISTRIBUTIONS OF AMINO ACIDS FROM SMALL SAMPLES , 1995 .

[16]  David Haussler,et al.  Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology , 1996, Comput. Appl. Biosci..

[17]  S. Henikoff,et al.  Blocks database and its applications. , 1996, Methods in enzymology.

[18]  M. Gribskov,et al.  Identification of Sequence Patterns with Profile Analysis , 1996 .

[19]  Michael Gribskov,et al.  Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[20]  Jorja G. Henikoff,et al.  Using substitution probabilities to improve position-specific scoring matrices , 1996, Comput. Appl. Biosci..

[21]  M. Gribskov,et al.  [13] Identification of sequence patterns with profile analysis , 1996 .

[22]  Yoav Freund,et al.  Predicting a binary sequence almost as well as the optimal biased coin , 2003, COLT '96.

[23]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[24]  D. Haussler,et al.  MUTUAL INFORMATION, METRIC ENTROPY AND CUMULATIVE RELATIVE ENTROPY RISK , 1997 .

[25]  Durbin,et al.  Biological Sequence Analysis , 1998 .

[26]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[27]  D. Haussler,et al.  Worst Case Prediction over Sequences under Log Loss , 1999 .

[28]  Golan Yona,et al.  Modeling protein families using probabilistic suffix trees , 1999, RECOMB.

[29]  David Haussler,et al.  A Discriminative Framework for Detecting Remote Protein Homologies , 2000, J. Comput. Biol..

[30]  Protein Family Classification Using Sparse Markov Transducers , 2000, ISMB.

[31]  Alberto Apostolico,et al.  Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space , 2000, RECOMB '00.