Toward Automated Discovery in the Biological Sciences

Knowledge discovery programs in the biological sciences require flexibility in the use of symbolic data and semantic information. Because of the volume of nonnumeric, as well as numeric, data, the programs must be able to explore a large space of possibly interesting relationships to discover those that are novel and interesting. Thus, the framework for the discovery program must facilitate proposing and selecting the next task to perform and performing the selected tasks. The framework we describe, called the agenda- and justification-based framework, has several properties that are desirable in semiautonomous discovery systems: It provides a mechanism for estimating the plausibility of tasks, it uses heuristics to propose and perform tasks, and it facilitates the encoding of general discovery strategies and the use of background knowledge. We have implemented the framework and our heuristics in a prototype program, HAMB, and have evaluated them in the domain of protein crystallization. Our results demonstrate that both reasons given for performing tasks and estimates of the interestingness of the concepts and hypotheses examined by HAMS contribute to its performance and that the program can discover novel, interesting relationships in biological data.

[1]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[2]  F. James Rohlf,et al.  Biometry: The Principles and Practice of Statistics in Biological Research , 1969 .

[3]  William H. Press,et al.  Numerical recipes in C , 2002 .

[4]  William H. Press,et al.  Numerical Recipes in FORTRAN - The Art of Scientific Computing, 2nd Edition , 1987 .

[5]  M. Sims Empirical and Analytic Discovery in IL , 1987 .

[6]  M. Lings,et al.  Articles , 1967, Soil Science Society of America Journal.

[7]  Douglas B. Lenat,et al.  AM, an artificial intelligence approach to discovery in mathematics as heuristic search , 1976 .

[8]  Bruce G. Buchanan,et al.  Closing the loop: an agenda- and justification-based framework for selecting the next discovery task to perform , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[9]  Lawrence Hunter,et al.  Mega-Classification: Discovering Motifs in Massive Datastreams , 1992, AAAI.

[10]  Bruce G. Buchanan,et al.  Closing the loop: heuristics for autonomous discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[11]  D Hennessy,et al.  Statistical methods for the objective design of screening procedures for macromolecular crystallization. , 2000, Acta crystallographica. Section D, Biological crystallography.

[12]  Bruce G. Buchanan,et al.  An Agenda- and Justification-Based Framework for Discovery Systems , 2003, Knowledge and Information Systems.

[13]  Douglas B. Lenat,et al.  EtmlSI O : A Program That Learns New Heuristics and Domain Concepts The Nature of Heuristics III : Program Design and Results , 2005 .

[14]  Wei-Min Shen,et al.  Functional transformations in AI discovery systems , 1988, [1988] Proceedings of the Twenty-First Annual Hawaii International Conference on System Sciences. Volume III: Decision Support and Knowledge Based Systems Track.

[15]  Bruce G. Buchanan,et al.  A framework for autonomous knowledge discovery from databases , 2001 .

[16]  Journal of Molecular Biology , 1959, Nature.

[17]  Mark Stefik,et al.  Planning with Constraints (MOLGEN: Part 1) , 1981, Artif. Intell..

[18]  Douglas B. Lenat,et al.  EURISKO: A Program That Learns New Heuristics and Domain Concepts , 1983, Artif. Intell..

[19]  Foster J. Provost,et al.  Inductive policy: The pragmatics of bias selection , 1995, Machine Learning.

[20]  Mark Stefik,et al.  Planning and Meta-Planning (MOLGEN: Part 2) , 1981, Artif. Intell..

[21]  Robert Engels,et al.  Planning Tasks for Knowledge Discovery in Databases; Performing Task-Oriented User-Guidance , 1996, KDD.

[22]  Jan M. Zytkow,et al.  Introduction: Cognitive autonomy in machine discovery , 1993, Machine Learning.

[23]  Atocha Aliseda,et al.  Logics in Scientific Discovery , 2004 .

[24]  Gary L. Gilliland,et al.  The Biological Macromolecule Crystallization Database and NASA Protein Crystal Growth Archive , 1996, Journal of research of the National Institute of Standards and Technology.

[25]  T. Sejnowski,et al.  Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[26]  Bruce G. Buchanan,et al.  Dendral and Meta-Dendral: Their Applications Dimension , 1978, Artif. Intell..

[27]  Bruce G. Buchanan,et al.  Heuristic DENDRAL - A program for generating explanatory hypotheses in organic chemistry. , 1968 .

[28]  Willi Klösgen,et al.  Explora: A Multipattern and Multistrategy Discovery Assistant , 1996, Advances in Knowledge Discovery and Data Mining.

[29]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[30]  Mark Stefik,et al.  Knowledge Base Management for Experiment Planning in Molecular Genetics , 1977, IJCAI.