Apples and oranges: avoiding different priors in Bayesian DNA sequence analysis

BackgroundOne of the challenges of bioinformatics remains the recognition of short signal sequences in genomic DNA such as donor or acceptor splice sites, splicing enhancers or silencers, translation initiation sites, transcription start sites, transcription factor binding sites, nucleosome binding sites, miRNA binding sites, or insulator binding sites. During the last decade, a wealth of algorithms for the recognition of such DNA sequences has been developed and compared with the goal of improving their performance and to deepen our understanding of the underlying cellular processes. Most of these algorithms are based on statistical models belonging to the family of Markov random fields such as position weight matrix models, weight array matrix models, Markov models of higher order, or moral Bayesian networks. While in many comparative studies different learning principles or different statistical models have been compared, the influence of choosing different prior distributions for the model parameters when using different learning principles has been overlooked, and possibly lead to questionable conclusions.ResultsWith the goal of allowing direct comparisons of different learning principles for models from the family of Markov random fields based on the same a-priori information, we derive a generalization of the commonly-used product-Dirichlet prior. We find that the derived prior behaves like a Gaussian prior close to the maximum and like a Laplace prior in the far tails. In two case studies, we illustrate the utility of the derived prior for a direct comparison of different learning principles with different models for the recognition of binding sites of the transcription factor Sp1 and human donor splice sites.ConclusionsWe find that comparisons of different learning principles using the same a-priori information can lead to conclusions different from those of previous studies in which the effect resulting from different priors has been neglected. We implement the derived prior in the open-source library Jstacs to enable an easy application to comparative studies of different learning principles in the field of sequence analysis.

[1]  Hanna M. Wallach,et al.  Efficient Training of Conditional Random Fields , 2002 .

[2]  M. Waterman,et al.  Proceedings of the seventh annual international conference on Research in computational molecular biology , 2003, RECOMB 2003.

[3]  Michael Q. Zhang,et al.  A weight array method for splicing signal analysis , 1993, Comput. Appl. Biosci..

[4]  Christopher B. Burge,et al.  Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals , 2003, RECOMB '03.

[5]  Michael I. Jordan,et al.  Learning with Mixtures of Trees , 2001, J. Mach. Learn. Res..

[6]  Joshua Goodman,et al.  Exponential Priors for Maximum Entropy Models , 2004, NAACL.

[7]  Timothy L. Bailey,et al.  Discriminative motif discovery in DNA and protein sequences using the DEME algorithm , 2007, BMC Bioinformatics.

[8]  Robert Castelo,et al.  Splice site identification by idlBNs , 2004, ISMB/ECCB.

[9]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[10]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[11]  Nir Friedman,et al.  Modeling dependencies in protein-DNA binding sites , 2003, RECOMB '03.

[12]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[13]  Steven Salzberg,et al.  A method for identifying splice sites and translational start sites in eukaryotic mRNA , 1997, Comput. Appl. Biosci..

[14]  Bin Shen,et al.  Structural Extension to Logistic Regression: Discriminative Parameter Learning of Belief Net Classifiers , 2002, Machine Learning.

[15]  Ramón López de Mántaras,et al.  Robust Bayesian Linear Classifier Ensembles , 2005, ECML.

[16]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[17]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[18]  Michael I. Jordan Graphical Models , 2003 .

[19]  Nak-Kyeong Kim,et al.  Finding sequence motifs with Bayesian models incorporating positional information: an application to transcription factor binding sites , 2008, BMC Bioinformatics.

[20]  Andrew P. Sage,et al.  Uncertainty in Artificial Intelligence , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[21]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[22]  Kiyoshi Asai,et al.  Modeling splicing sites with pairwise correlations , 2002, ECCB.

[23]  Alexander J. Hartemink,et al.  Informative priors based on transcription factor structural class improve de novo motif discovery , 2006, ISMB.

[24]  Gunnar Rätsch,et al.  ARTS: accurate recognition of transcription starts in human , 2006, ISMB.

[25]  R Staden Computer methods to locate signals in nucleic acid sequences , 1984, Nucleic Acids Res..

[26]  A. J. Feelders,et al.  Discriminative Scoring of Bayesian Network Classifiers: a Comparative Study , 2006, Probabilistic Graphical Models.

[27]  Irene K. Moore,et al.  A genomic code for nucleosome positioning , 2006, Nature.

[28]  Juan Roberto Castelo Valdueza,et al.  The Discrete Acyclic Digraph Markov Model in Data Mining , 2002 .

[29]  Jens Keilwagen,et al.  Recognition of splice sites using maximum conditional likelihood , 2007, LWA.

[30]  Michael Q. Zhang,et al.  Analysis of the Vertebrate Insulator Protein CTCF-Binding Sites in the Human Genome , 2007, Cell.

[31]  William Stafford Noble,et al.  Nucleosome positioning signals in genomic DNA. , 2007, Genome research.

[32]  Marc Toussaint,et al.  Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[33]  Alexander E. Kel,et al.  Supervised Posteriors for DNA-motif Classification , 2007, German Conference on Bioinformatics.

[34]  Andrew McCallum,et al.  Gene Prediction with Conditional Random Fields , 2005 .

[35]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[36]  Wray L. Buntine Theory Refinement on Bayesian Networks , 1991, UAI.

[37]  Alexander E. Kel,et al.  MATCHTM: a tool for searching transcription factor binding sites in DNA sequences , 2003, Nucleic Acids Res..

[38]  Franz Pernkopf,et al.  Discriminative versus generative parameter and structure learning of Bayesian network classifiers , 2005, ICML.

[39]  E. Wingender,et al.  MATCH: A tool for searching transcription factor binding sites in DNA sequences. , 2003, Nucleic acids research.

[40]  Wei Luo,et al.  A new hybrid method for Bayesian network learning With dependency constraints , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[41]  BMC Bioinformatics , 2005 .

[42]  Stanley F. Chen,et al.  A Gaussian Prior for Smoothing Maximum Entropy Models , 1999 .

[43]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[44]  Koby Crammer,et al.  Global Discriminative Learning for Higher-Accuracy Computational Gene Prediction , 2007, PLoS Comput. Biol..

[45]  Pedro M. Domingos,et al.  Learning Bayesian network classifiers by maximizing conditional likelihood , 2004, ICML.

[46]  T. D. Schneider,et al.  Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. , 1982, Nucleic acids research.

[47]  Armin Shmilovici,et al.  Identification of transcription factor binding sites with variable-order Bayesian networks , 2005, Bioinform..

[48]  E. B. Andersen,et al.  Information Science and Statistics , 1986 .

[49]  Chung-Chin Lu,et al.  Prediction of splice sites with dependency graphs and their expanded bayesian networks , 2005, Bioinform..

[50]  Dan Klein,et al.  Optimization, Maxent Models, and Conditional Estimation without Magic , 2003, NAACL.

[51]  Henry Tirri,et al.  On Supervised Learning of Bayesian Network Parameters , 2002 .

[52]  Simon Kasif,et al.  Modeling splice sites with Bayes networks , 2000, Bioinform..

[53]  Vasant Honavar,et al.  Discriminatively trained Markov model for sequence classification , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[54]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.