A Bayesian framework for combining gene predictions

MOTIVATION Gene identification and gene discovery in new genomic sequences is one of the most timely computational questions addressed by bioinformatics scientists. This computational research has resulted in several systems that have been used successfully in many whole-genome analysis projects. As the number of such systems grows the need for a rigorous way to combine the predictions becomes more essential. RESULTS In this paper we provide a Bayesian network framework for combining gene predictions from multiple systems. The framework allows us to treat the problem as combining the advice of multiple experts. Previous work in the area used relatively simple ideas such as majority voting. We introduce, for the first time, the use of hidden input/output Markov models for combining gene predictions. We apply the framework to the analysis of the Adh region in Drosophila that has been carefully studied in the context of gene finding and used as a basis for the GASP competition. The main challenge in combination of gene prediction programs is the fact that the systems are relying on similar features such as cod on usage and as a result the predictions are often correlated. We show that our approach is promising to improve the prediction accuracy and provides a systematic and flexible framework for incorporating multiple sources of evidence into gene prediction systems.

[1]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[2]  J. Mesirov,et al.  Hybrid system for protein secondary structure prediction. , 1992, Journal of molecular biology.

[3]  V. Solovyev,et al.  Ab initio gene finding in Drosophila genomic DNA. , 2000, Genome research.

[4]  Edward C. Uberbacher,et al.  GRAIL: a multi-agent neural network system for gene identification , 1996, Proc. IEEE.

[5]  M. Kanehisa,et al.  Prediction of splice junctions in mRNA sequences. , 1985, Nucleic acids research.

[6]  S. Salzberg,et al.  Genomic sequence of a Lyme disease spirochaete, Borrelia burgdorferi , 1997, Nature.

[7]  K. Murakami,et al.  Gene recognition by combination of several gene-finding programs , 1998, Bioinform..

[8]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[9]  C. Burge Chapter 8 – Modeling dependencies in pre-mRNA splicing signals , 1998 .

[10]  S. Lewis,et al.  Genome annotation assessment in Drosophila melanogaster. , 2000, Genome research.

[11]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[12]  A. Krogh,et al.  Using database matches with for HMMGene for automated gene detection in Drosophila. , 2000, Genome research.

[13]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[14]  S. Salzberg,et al.  Microbial gene identification using interpolated Markov models. , 1998, Nucleic acids research.

[15]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[16]  Simon Kasif,et al.  Modeling splice sites with Bayes networks , 2000, Bioinform..

[17]  D. Haussler,et al.  Genie--gene finding in Drosophila melanogaster. , 2000, Genome research.

[18]  Simon Kasif,et al.  Induction of Oblique Decision Trees , 1993, IJCAI.

[19]  Simon Kasif,et al.  Computational methods in molecular biology , 1998 .

[20]  Finn Verner Jensen,et al.  Introduction to Bayesian Networks , 2008, Innovations in Bayesian Networks.

[21]  David Haussler,et al.  A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA , 1996, ISMB.

[22]  Michael I. Jordan Learning in Graphical Models , 1999, NATO ASI Series.