From Massive Data Sets to Science Catalogs: Applications and Challenges

With hardware advances in scientiic instruments and data gathering techniques comes the inevitable ood of data that can render traditional approaches to science data analysis severely inadequate. The traditional approach of manual and exhaustive analysis of a data set is no longer feasible for many tasks ranging from remote sensing, astronomy, and atmospherics to medicine, molecular biology, and biochemistry. In this paper we present our views as practitioners engaged in building computational systems to help scientists deal with large data sets. We focus on what we view as challenges and shortcomings of the current state-of-the-art in data analysis in view of the massive data sets that are still awaiting analysis. The presentation is grounded in applications in astronomy, planetary sciences, solar physics, and atmospherics that are currently driving much of our work at JPL. keywords: science data analysis, limitations of current methods, challenges for massive data sets, classiication learning, clustering.

[1]  S. Djorgovski,et al.  The discovery of five quasars at z>4 using the Second Palomar Sky Survey , 1995 .

[2]  Padhraic Smyth,et al.  Bounds on the mean classification error rate of multiple experts , 1996, Pattern Recognit. Lett..

[3]  J. Tyson,et al.  Focas: faint object classification and analysis system. , 1981 .

[4]  John Uebersax,et al.  Statistical Modeling of Expert Ratings on Medical Treatment Appropriateness , 1993 .

[5]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[6]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[7]  Wendy R. Fox,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1991 .

[8]  Ronald Greeley,et al.  Small volcanic edifices and volcanism in the plains of Venus , 1992 .

[9]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[10]  Ronald J. Brachman,et al.  The Process of Knowledge Discovery in Databases , 1996, Advances in Knowledge Discovery and Data Mining.

[11]  G. Pettengill,et al.  Magellan: Mission Summary , 1991, Science.

[12]  David H. Bruning Mission to planet Earth. , 1995, Environmental Health Perspectives.

[13]  S. Djorgovski,et al.  Automated Star/Galaxy Classification for Digitized Poss-II , 1995 .

[14]  K. H. Fasman,et al.  The GDB Human Genome Data Base anno 1994. , 1994, Nucleic acids research.

[15]  Hisashi Nakamura,et al.  Fast Spatio-Temporal Data Mining of Large Geophysical Datasets , 1995, KDD.

[16]  Usama M. Fayyad,et al.  Branching on Attribute Values in Decision Tree Generation , 1994, AAAI.

[17]  S. Djorgovski,et al.  Initial Galaxy Counts from Digitized Poss-II , 1995 .

[18]  David J. Hand,et al.  Deconstructing Statistical Questions , 1994 .

[19]  J. Aubele,et al.  Small domes on Venus: Characteristics and origin , 1990 .

[20]  Pietro Perona,et al.  Automating the hunt for volcanoes on Venus , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Stanley Letovsky,et al.  The GDB Human Genome Database Anno 1997 , 1997, Nucleic Acids Res..

[22]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[23]  Steven B. Simon Simon :Venus, the Geological Story , 1995 .

[24]  Usama M. Fayyad,et al.  Automating the Analysis and Cataloging of Sky Surveys , 1996, Advances in Knowledge Discovery and Data Mining.