Data mining a large digital sky survey: from the challenges to the scientific results

The analysis and an efficient scientific exploration of the digital Palomar observatory sky survey represents a major technical challenge. The input data set consists of 3 Terabytes of pixel information, and contains a few billion sources. We describe some of the specific scientific problems posed by the data, including searches for distant quasars and clusters of galaxies, and the data-mining techniques we are exploring in addressing them Machine- assisted discovery methods may become essential for the analysis of such multi-Terabyte data sets. New and future approaches involve unsupervised classification and clustering analysis in the Giga-object data space, including various Bayesian techniques. In addition to the searches for known types of objects in this database, these techniques may also offer the possibility of discovering previously unknown, rare types of astronomical objects.

[1]  G. Abell The Distribution of rich clusters of galaxies , 1958 .

[2]  Alexander S. Szalay,et al.  The Sloan Digital Sky Survey , 1999, Comput. Sci. Eng..

[3]  S. Djorgovski,et al.  The discovery of five quasars at z>4 using the Second Palomar Sky Survey , 1995 .

[4]  Stephen C. Odewahn,et al.  Galaxy Properties at the North Galactic Pole , 1993 .

[5]  S. Djorgovski,et al.  Automated Star/Galaxy Classification for Digitized Poss-II , 1995 .

[6]  S. Djorgovski,et al.  Initial Galaxy Counts from Digitized Poss-II , 1995 .

[7]  I. Reid,et al.  The Second Palomar Sky Survey , 1991 .

[8]  S. Odewahn,et al.  Automated star/galaxy discrimination with neural networks , 1992 .

[9]  Brazil,et al.  Cataloging of the Digitized POSS-II, and Some Initial Scientific Results From It , 1996, astro-ph/9612108.

[10]  Usama Fayyad,et al.  SkICAT: A cataloging and analysis tool for wide field imaging surveys , 1992 .

[11]  S. C. Odewahn,et al.  AUTOMATED CLASSIFICATION OF ASTRONOMICAL IMAGES , 1995 .

[12]  S. Djorgovski,et al.  Cataloging the Northern Sky Using a new Generation of Software Technology , 1994 .

[13]  Usama Fayyad,et al.  THE SKICAT SYSTEM FOR PROCESSING AND ANALYZING DIGITAL IMAGING SKY SURVEYS , 1995 .

[14]  G. Abell,et al.  The distribution of rich clusters of galaxies. , 1957 .

[15]  J. B. Oke,et al.  The Palomar Distant Clusters Survey. I. The Cluster Catalog , 1995, astro-ph/9511011.

[16]  S. Djorgovski,et al.  The Space Density of Z > 4 Quasars from the Second Palomar Sky Survey , 1997 .

[17]  Marc Postman,et al.  The Palomar Distant Cluster Survey , 1995 .

[18]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[19]  Usama M. Fayyad,et al.  SKICAT: A Machine Learning System for Automated Cataloging of Large Scale Sky Surveys , 1993, ICML.

[20]  J. Gunn,et al.  The Sloan Digital Sky Survey , 1994, astro-ph/9412080.

[21]  G. Aldering,et al.  Galaxy Properties at the North Galactic Pole. I. Photometric Properties on Large Spatial Scales , 1995 .

[22]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[23]  Alexander G. Gray,et al.  Analysis of Digital POSS-II Catalogs Using Hierarchical Unsupervised Learning Algorithms , 1996 .

[24]  James Liebert,et al.  The Two Micron All Sky Survey (2MASS): Overview and Status , 1997 .

[25]  Alexander G. Gray,et al.  Clustering Analysis Algorithms and Their Applications to Digital POSS-II Catalogs , 1995 .

[26]  S. Djorgovski,et al.  The Luminosity Function of z>4 Quasars from the Second Palomar Sky Survey , 1995 .

[27]  Neta A. Bahcall,et al.  Large-Scale Structure in the Universe Indicated by Galaxy Clusters , 1988 .

[28]  Alexander G. Gray,et al.  Towards an Objectively Defined Catalog of Galaxy Clusters from the Digitized POSS-II , 1997 .