KDD for Science Data Analysis: Issues and Examples

The analysis of the massive data sets collected by scientific instruments demands automation as a prerequisite to analysis. There is an urgent need to create an intermediate level at which scientists can operate effectively; isolating them from the massive sizes and harnessing human analysis capabilities to focus on tasks in which machines do not even remotely approach humans—namely, creative data analysis, theory and hypothesis formation, and drawing insights into underlying phenomena. We give an overview of the main issues in the exploitation of scientific datasets, present five case studies where KDD tools play important and enabling roles, and conclude with future challenges for data mining and KDD techniques in science data analysis.

[1]  I. Reid,et al.  The Second Palomar Sky Survey , 1991 .

[2]  R. C. Underwood,et al.  Stochastic context-free grammars for tRNA modeling. , 1994, Nucleic acids research.

[3]  S. Djorgovski,et al.  Initial Galaxy Counts from Digitized Poss-II , 1995 .

[4]  E. Snyder,et al.  Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. , 1993, Nucleic acids research.

[5]  Sean R. Eddy,et al.  Multiple Alignment Using Hidden Markov Models , 1995, ISMB.

[6]  Pietro Perona,et al.  Automating the hunt for volcanoes on Venus , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Usama M. Fayyad,et al.  SKICAT: A Machine Learning System for Automated Cataloging of Large Scale Sky Surveys , 1993, ICML.

[8]  D. Searls,et al.  Gene structure prediction by linguistic methods. , 1994, Genomics.

[9]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[10]  Michel Manago,et al.  Integrating Induction and Case-Based Reasoning: Methodological Approach and First Evaluations , 1994, EWCBR.

[11]  Hisashi Nakamura,et al.  Fast Spatio-Temporal Data Mining of Large Geophysical Datasets , 1995, KDD.

[12]  Anders Krogh,et al.  SAM: SEQUENCE ALIGNMENT AND MODELING SOFTWARE SYSTEM , 1995 .

[13]  Temple F. Smith,et al.  Prediction of gene structure. , 1992, Journal of molecular biology.

[14]  David Haussler,et al.  A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA , 1996, ISMB.

[15]  S. Djorgovski,et al.  Automated Star/Galaxy Classification for Digitized Poss-II , 1995 .

[16]  Lawrence B. Holder,et al.  Analyzing the Benefits of Domain Knowledge in Substructure Discovery , 1995, KDD.

[17]  D. Haussler,et al.  A hidden Markov model that finds genes in E. coli DNA. , 1994, Nucleic acids research.

[18]  V. Solovyev,et al.  Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. , 1994, Nucleic acids research.

[19]  Yin Xu,et al.  An Improved System for Exon Recognition and Gene Modeling in Human DNA Sequence , 1994, ISMB.

[20]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[21]  C. Chothia One thousand families for the molecular biologist , 1992, Nature.

[22]  James W. Head,et al.  Venus volcanism: Classification of volcanic features and structures, associations, and global distribution from Magellan data , 1992 .

[23]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[24]  R. Durbin,et al.  RNA sequence analysis using covariance models. , 1994, Nucleic acids research.

[25]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.