Biological Data Mining Using Bayesian Neural Networks: A Case Study

Biological data mining is the activity of finding significant information in biomolecular data. The significant information may refer to motifs, clusters, genes, and protein signatures. This paper presents an example of biological data mining: the recognition of promoters in DNA. We propose a two-level ensemble of classifiers to recognize E. Coli promoter sequences. The first-level classifiers include three Bayesian neural networks that learn from three different feature sets. The outputs of the first-level classifiers are combined in the second-level to give the final result. Empirical study shows that a precision rate of 92.2% is achieved, indicating an excellent performance of the proposed approach.

[1]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[2]  Cathy H. Wu Artificial Neural Networks for Molecular Sequence Analysis , 1997, Comput. Chem..

[3]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[4]  Michael Q. Zhang,et al.  A weight array method for splicing signal analysis , 1993, Comput. Appl. Biosci..

[5]  Laxmi Parida Pattern Discovery in Biomolecular Data: Tools, Techniques and Applications , 1999 .

[6]  Steven Salzberg,et al.  A Decision Tree System for Finding Genes in DNA , 1998, J. Comput. Biol..

[7]  David R. Gilbert,et al.  Approaches to the Automatic Discovery of Patterns in Biosequences , 1998, J. Comput. Biol..

[8]  Jude W. Shavlik,et al.  Machine learning approaches to gene recognition , 1994, IEEE Expert.

[9]  A A Deev,et al.  Non-canonical sequence elements in the promoter structure. Cluster analysis of promoters recognized by Escherichia coli RNA polymerase. , 1997, Nucleic acids research.

[10]  D. K. Hawley,et al.  Compilation and analysis of Escherichia coli promoter DNA sequences. , 1983, Nucleic acids research.

[11]  Raffaele Giancarlo,et al.  Sequence alignment in molecular biology , 1998, Mathematical Support for Molecular Biology.

[12]  J. Mesirov,et al.  Hybrid system for protein secondary structure prediction. , 1992, Journal of molecular biology.

[13]  G. Stormo,et al.  Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. , 1992, Journal of molecular biology.

[14]  M J Sternberg,et al.  Prediction of structural and functional features of protein and nucleic acid sequences by artificial neural networks. , 1992, Biochemistry.

[15]  D. Shasha,et al.  Discovering active motifs in sets of related protein sequences and using them for classification. , 1994, Nucleic acids research.

[16]  David Haussler,et al.  A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA , 1996, ISMB.

[17]  Temple F. Smith,et al.  Recognition of characteristic patterns in sets of functionally equivalent DNA sequences , 1987, Comput. Appl. Biosci..

[18]  R Staden Computer methods to locate signals in nucleic acid sequences , 1984, Nucleic Acids Res..

[19]  Thomas G. Dietterich Machine-Learning Research , 1997, AI Mag..

[20]  David J. C. MacKay,et al.  The Evidence Framework Applied to Classification Networks , 1992, Neural Computation.

[21]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[22]  David Haussler,et al.  A brief look at some machine learning problems in genomics , 1997, COLT '97.

[23]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[24]  S. Knudsen,et al.  Prediction of human mRNA donor and acceptor sites from the DNA sequence. , 1991, Journal of molecular biology.

[25]  B A Shapiro,et al.  Complementary classification approaches for protein sequences. , 1996, Protein engineering.

[26]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[27]  H. Margalit,et al.  Compilation of E. coli mRNA promoter sequences. , 1993, Nucleic acids research.

[28]  M. Waterman,et al.  Rigorous pattern-recognition methods for DNA sequences. Analysis of promoter sequences from Escherichia coli. , 1985, Journal of molecular biology.

[29]  D. Mackay,et al.  A Practical Bayesian Framework for Backprop Networks , 1991 .

[30]  Haym Hirsh,et al.  Using background knowledge to improve inductive learning of DNA sequences , 1994, Proceedings of the Tenth Conference on Artificial Intelligence for Applications.

[31]  Karen A. Frenkel,et al.  The human genome project and informatics , 1991, CACM.

[32]  Anders Gorm Pedersen,et al.  Investigations of Escherichia coli Promoter Sequences with Artificial Neural Networks: New Signals Discovered Upstream of the Transcriptional Startpoint , 1995, ISMB.

[33]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[34]  Pierre Baldi,et al.  Characterization of Prokaryotic and Eukaryotic Promoters Using Hidden Markov Models , 1996, ISMB.

[35]  Thomas G. Dietterich Machine-Learning Research Four Current Directions , 1997 .