Finding Genes by Case-Based Reasoning in the Presence of Noisy Case Boundaries *

Effectively using previous cases requires that a reasoner firstmatch, in some fashion, the current problem against a large library of stored cases. One largely unaddressed task in case-based reasoning is the process of parsing continuous input into discrete cases. If this parsing is not done accurately, the relevant previous cases may not be found and the advantages of case-based problem solving will be lost. Parsing the data into cases is further complicated when the input data is noisy. This paper presents an approach to applying the case-based paradigm in the presence of noisy case boundaries. The approach has been fully implemented and applied in the domain of molecular biology; specifically, a successful case-based approach to gene finding is described. An empirical study demonstrates that the method is robust even with high error rates. This system is being used in conjunction with a Human Genome project in the Wisconsin Department of Genetics that is sequencing the DNA of the bacterium E. coli.

[1]  Kathryn E. Sidman,et al.  The protein identification resource (PIR). , 1986, Nucleic acids research.

[2]  P. Campbell,et al.  Mapping and Sequencing the Human Genome , 1989, Biotechnology and applied biochemistry.

[3]  F. Crick,et al.  Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid , 1974, Nature.

[4]  Jeffrey W. Roberts,et al.  遺伝子の分子生物学 = Molecular biology of the gene , 1970 .

[5]  Jude W. Shavlik,et al.  Training Knowledge-Based Neural Networks to Recognize Genes , 1990, NIPS.

[6]  F. Crick,et al.  Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid , 1953, Nature.

[7]  G. Asherson Gonadotropins: Physicochemical and Immunological Properties , 1966 .

[8]  David B. Searls Investigating the Linguistics of DNA with Definite Clause Grammars , 1989, NACLP.

[9]  R. Staden Finding protein coding regions in genomic sequences. , 1990, Methods in enzymology.

[10]  James W. Fickett,et al.  The GenBank genetic sequence databank , 1986, Nucleic Acids Res..

[11]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[12]  T. D. Schneider,et al.  Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. , 1982, Nucleic acids research.

[13]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[14]  Peter L. Hammer,et al.  Stability in Circular Arc Graphs , 1988, J. Algorithms.

[15]  M. Gribskov,et al.  The codon preference plot: graphic analysis of protein coding sequences and prediction of gene expression , 1984, Nucleic Acids Res..

[16]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[17]  David B. Searls Representing Genetic Information with Formal Grammars , 1988, AAAI.

[18]  V A McKusick,et al.  Mapping and sequencing the human genome. , 1989, The New England journal of medicine.

[19]  Kazuo Nakajima,et al.  An Optimal Algorithm for Finding a Maximum Independent Set of a Circular-Arc Graph , 1988, SIAM J. Comput..

[20]  C Burks,et al.  The GenBank genetic sequence data bank. , 1988, Nucleic acids research.