Cataloging and Mining Massive Datasets for Science Data Analysis

Abstract With hardware advances in sensors, scientific instruments, and data storage techniques has come the inevitable flood of data that threatens to render traditional approaches to data analysis inadequate. The classic paradigm of a scientist manually and exhaustively going through a dataset is no longer feasible for many problems, ranging from remote sensing, astronomy, and atmospheric science to medicine, molecular biology, and biochemistry. This article presents our views as practitioners engaged in building computational systems to help scientists analyze and reduce massive datasets. We focus on what we view as challenges and shortcomings of the current state-of-the-art in data analysis in view of the massive datasets that are still awaiting analysis. The presentation focuses on recent and current scientific data analysis applications in astronomy, planetary sciences, solar physics, and atmospheric science that we have been involved with at the Jet Propulsion Laboratory (JPL).

[1]  U. Fayyad,et al.  Scaling EM (Expectation Maximization) Clustering to Large Databases , 1998 .

[2]  Saleem Mukhtar,et al.  Representing Solar Active Regions with Triangulations , 1998, COMPSTAT.

[3]  Usama M. Fayyad,et al.  The Attribute Selection Problem in Decision Tree Generation , 1992, AAAI.

[4]  Surajit Chaudhuri,et al.  Scalable classification over SQL databases , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[5]  Michael C. Burl Recognition of visual object classes , 1996 .

[6]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[7]  Peter John Cattermole Venus: The Geological Story , 1994 .

[8]  Padhraic Smyth,et al.  Multiple Regimes in Northern Hemisphere Height Fields via MixtureModel Clustering* , 1999, Journal of the Atmospheric Sciences.

[9]  U. Fayyad On the induction of decision trees for multiple concept learning , 1991 .

[10]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[11]  Stanley Letovsky,et al.  The GDB Human Genome Database Anno 1997 , 1997, Nucleic Acids Res..

[12]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[13]  Hisashi Nakamura,et al.  Fast Spatio-Temporal Data Mining of Large Geophysical Datasets , 1995, KDD.

[14]  Pietro Perona,et al.  Automating the hunt for volcanoes on Venus , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Raúl E. Valdés-Pérez,et al.  Principles of Human Computer Collaboration for Knowledge Discovery in Science , 1999, Artif. Intell..

[16]  S. Djorgovski,et al.  Initial Galaxy Counts from Digitized Poss-II , 1995 .

[17]  Jim Gray,et al.  Microsoft TerraServer , 1998, SIGMOD 2000.

[18]  Usama M. Fayyad,et al.  Data mining and KDD: Promise and challenges , 1997, Future Gener. Comput. Syst..

[19]  R. Rabbitt,et al.  3D brain mapping using a deformable neuroanatomy. , 1994, Physics in medicine and biology.

[20]  Peter C. Cheeseman,et al.  Onboard Science Data Analysis: Applying Data Mining to Science-Directed Autonomy , 1998, IEEE Intell. Syst..

[21]  K. Mardia,et al.  Statistical Shape Analysis , 1998 .

[22]  Pietro Perona,et al.  Knowledge Discovery in Large Image Databases: Dealing with Uncertainties in Ground Truth , 1994, KDD Workshop.

[23]  Padhraic Smyth,et al.  Bounds on the mean classification error rate of multiple experts , 1996, Pattern Recognit. Lett..

[24]  Scott Hensley,et al.  Magellan mission summary , 1992 .

[25]  Wesley T. Huntress,et al.  Mission to Planet Earth , 1991 .

[26]  R. Blender,et al.  Identification of cyclone‐track regimes in the North Atlantic , 1997 .

[27]  X. Cheng,et al.  Cluster Analysis of the Northern Hemisphere Wintertime 500-hPa Height Field: Spatial Patterns , 1993 .

[28]  David J. Hand,et al.  Deconstructing Statistical Questions , 1994 .

[29]  J. Aubele,et al.  Small domes on Venus: Characteristics and origin , 1990 .

[30]  Carla E. Brodley,et al.  Applying classification algorithms in practice , 1997, Stat. Comput..

[31]  Usama M. Fayyad,et al.  Branching on Attribute Values in Decision Tree Generation , 1994, AAAI.

[32]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[33]  Padhraic Smyth,et al.  Trajectory clustering with mixtures of regression models , 1999, KDD '99.

[34]  John Uebersax,et al.  Statistical Modeling of Expert Ratings on Medical Treatment Appropriateness , 1993 .

[35]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[36]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[37]  Usama M. Fayyad,et al.  SKICAT: A Machine Learning System for Automated Cataloging of Large Scale Sky Surveys , 1993, ICML.

[38]  Michael J. Turmon,et al.  Bayesian Inference for Identifying Solar Active Regions , 1997, KDD.

[39]  Ronald Greeley,et al.  Small volcanic edifices and volcanism in the plains of Venus , 1992 .

[40]  S. Djorgovski,et al.  Automated Star/Galaxy Classification for Digitized Poss-II , 1995 .

[41]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[42]  S. Djorgovski,et al.  The discovery of five quasars at z>4 using the Second Palomar Sky Survey , 1995 .

[43]  Surajit Chaudhuri,et al.  On the Efficient Gathering of Sufficient Statistics for Classification from Large SQL Databases , 1998, KDD.

[44]  J. Tyson,et al.  Focas: faint object classification and analysis system. , 1981 .

[45]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[46]  K. H. Fasman,et al.  The GDB Human Genome Data Base anno 1994. , 1994, Nucleic acids research.

[47]  Surajit Chaudhuri,et al.  An overview of data warehousing and OLAP technology , 1997, SGMD.