Scientific Data Mining in Astronomy

We describe the application of data mining algorithms to research problems in astronomy. We posit that data mining has always been fundamental to astronomical research, since data mining is the basis of evidence-based discovery, including classification, clustering, and novelty discovery. These algorithms represent a major set of computational tools for discovery in large databases, which will be increasingly essential in the era of data-intensive astronomy. Historical examples of data mining in astronomy are reviewed, followed by a discussion of one of the largest data-producing projects anticipated for the coming decade: the Large Synoptic Survey Telescope (LSST). To facilitate data-driven discoveries in astronomy, we envision a new data-oriented research paradigm for astronomy and astrophysics -- astroinformatics. Astroinformatics is described as both a research approach and an educational imperative for modern data-intensive astronomy. An important application area for large time-domain sky surveys (such as LSST) is the rapid identification, characterization, and classification of real-time sky events (including moving objects, photometrically variable objects, and the appearance of transients). We describe one possible implementation of a classification broker for such events, which incorporates several astroinformatics techniques: user annotation, semantic tagging, metadata markup, heterogeneous data integration, and distributed data mining. Examples of these types of collaborative classification and discovery approaches within other science disciplines are presented.

[1]  Salvatore Sessa,et al.  Advanced data mining tools for exploring large astronomical databases , 2001, SPIE Optics + Photonics.

[2]  Alexander S. Szalay,et al.  Designing a multi-petabyte database for LSST , 2005, SPIE Astronomical Telescopes + Instrumentation.

[3]  Robert J. Brunner,et al.  Robust Machine Learning Applied to Astronomical Data Sets. I. Star-Galaxy Classification of the Sloan Digital Sky Survey DR3 Using Decision Trees , 2006, astro-ph/0606541.

[4]  Alexander S. Szalay,et al.  Petabyte Scale Data Mining: Dream or Reality? , 2002, SPIE Astronomical Telescopes + Instrumentation.

[5]  A. Fontana,et al.  Photometric redshifts with the Multilayer Perceptron Neural Network: Application to the HDF-S and SDSS , 2003, astro-ph/0312064.

[6]  Canada.,et al.  Data Mining and Machine Learning in Astronomy , 2009, 0906.2173.

[7]  Haimin Wang,et al.  Automatic Detection and Classification of Coronal Mass Ejections , 2006 .

[8]  Alexander Szalay,et al.  Photometric Redshift Estimation on SDSS Data Using Random Forests , 2007 .

[9]  R. A. Howard,et al.  A new methodology to integrate planetary quarantine requirements into mission planning, with application to a Jupiter orbiter , 1975 .

[10]  Alexander S. Szalay,et al.  TO APPEAR IN THE ASTROPHYSICAL JOURNAL Preprint typeset using LATEX style emulateapj v. 10/09/06 PROBABILISTIC CROSS-IDENTIFICATION OF ASTRONOMICAL SOURCES , 2008 .

[11]  Petri Mähönen,et al.  Automated Star-Galaxy Discrimination for Large Surveys , 2001 .

[12]  Ajit Kembhavi,et al.  A difference boosting neural network for automated star-galaxy classification , 2002 .

[13]  Shaukat N. Goderya,et al.  Morphological Classification of Galaxies using Computer Vision and Artificial Neural Networks: A Computational Scheme , 2002 .

[14]  J. Angel,et al.  Adaptive optics for array telescopes using neural-network techniques , 1990, Nature.

[15]  Timothy E. Eastman,et al.  Complementary Frameworks of Scientific Inquiry: Hypothetico-Deductive, Hypothetico-Inductive, and Observational-Inductive , 2009 .

[16]  Kirk D. Borne,et al.  The Revolution in Astronomy Education: Data Science for the Masses , 2009, ArXiv.

[17]  A. Naim,et al.  Automated morphological classification of APM galaxies by supervised artificial neural networks , 1995, astro-ph/9503001.

[18]  S. Djorgovski,et al.  Fundamental Properties of Elliptical Galaxies , 1987 .

[19]  B. Whitmore,et al.  An objective classification system for spiral galaxies. I The two dominant dimensions , 1984 .

[20]  David B. Skillicorn,et al.  Distributed Data Mining for Astrophysical Datasets , 2005 .

[21]  S. Andreon,et al.  Wide field imaging – I. Applications of neural networks to object detection and star/galaxy classification , 2000, astro-ph/0006115.

[22]  Yong-Heng Zhao,et al.  Two Novel Approaches for Photometric Redshift Estimation based on SDSS and 2MASS , 2007, 0707.2250.

[23]  Haimonti Dutta,et al.  Distributed Top-K Outlier Detection from Astronomy Catalogs using the DEMAC System , 2007, SDM.

[24]  S. G. Djorgovski,et al.  Automated probabilistic classification of transients and variables , 2008, 0802.3199.

[25]  Ofer Lahav,et al.  ANNz: Estimating Photometric Redshifts Using Artificial Neural Networks , 2004 .

[26]  Kirk D. Borne Distributed data mining in the National Virtual Observatory , 2003, SPIE Defense + Commercial Sensing.

[27]  Kirk D. Borne,et al.  Astroinformatics: A 21st Century Approach to Astronomy , 2009, ArXiv.

[28]  Robert J. Brunner,et al.  Exploration of parameter spaces in a virtual observatory , 2001, SPIE Optics + Photonics.

[29]  H. Wechsler,et al.  Automatic Detection and Tracking of Coronal Mass Ejections in Coronagraph Time Series , 2008 .

[30]  R. Rosner,et al.  Optimization algorithms: simulated annealing and neural network processing , 1986 .

[31]  M. Al-Omari,et al.  Automated Prediction of CMEs Using Machine Learning of CME – Flare Associations , 2008 .

[32]  et al,et al.  Matching of catalogues by probabilistic pattern classification , 2006 .

[33]  J. Bloom,et al.  Towards a Real-time Transient Classification Engine , 2008, 0802.2249.

[34]  T. Downs,et al.  Applying machine learning to catalogue matching in astrophysics , 2005, astro-ph/0504013.

[35]  Philip A. Pinto,et al.  The Large Synoptic Survey Telescope , 2006 .

[36]  Roberto Trotta,et al.  Monolithic or hierarchical star formation? A new statistical analysis , 2007, 0709.1104.

[37]  Alexander S. Szalay,et al.  Petascale Computational Systems: Balanced CyberInfrastructure in a Data-Centric World , 2006 .

[38]  Stephen C. Odewahn,et al.  STAR-GALAXY SEPARATION WITH A NEURAL NETWORK. II. MULTIPLE SCHMIDT PLATE FIELDS , 1993 .

[39]  Kirk D. Borne,et al.  Data Mining for Extra-Solar Planets , 2007 .

[40]  Kirk D. Borne A machine learning classification broker for the LSST transient database , 2008 .

[41]  Valeriy V. Gavrishchaka,et al.  Support vector machine as an efficient tool for high‐dimensional data processing: Application to substorm forecasting , 2001 .

[42]  M. Way,et al.  Novel Methods for Predicting Photometric Redshifts from Broadband Photometry Using Virtual Sensors , 2006 .

[43]  Bruce Margon,et al.  A Census of Object Types and Redshift Estimates in the SDSS Photometric Catalog from a Trained Decision-Tree Classifier , 2005 .

[44]  David Bazell,et al.  A Comparison of Neural Network Algorithms and Preprocessing Methods for Star-Galaxy Discrimination , 1998 .

[45]  O. Lahav,et al.  Morphological Classification of galaxies by Artificial Neural Networks , 1992 .

[46]  Petri Mähönen,et al.  Fuzzy Classifier for Star-Galaxy Separation , 2000 .

[47]  A. Pasquali,et al.  A Principal Component Analysis approach to the Star Formation History of elliptical galaxies in Compact Groups , 2005, astro-ph/0511753.

[48]  Huan Lin,et al.  A Galaxy Photometric Redshift Catalog for the Sloan Digital Sky Survey Data Release 6 , 2007, 0708.0030.

[49]  Y. Wadadekar Estimating Photometric Redshifts Using Support Vector Machines , 2004, astro-ph/0412005.

[50]  Kirk Borne Data Mining in Distributed Databases for Interacting Galaxies , 2005 .

[51]  Robert G. Mann,et al.  AstroDAS: Sharing Assertions Across Astronomy Catalogues Through Distributed Annotation , 2006, IPAW.

[52]  S. Odewahn,et al.  Automated star/galaxy discrimination with neural networks , 1992 .

[53]  Kirk D. Borne,et al.  Collaborative Knowledge Sharing for E-Science , 2006, AAAI Fall Symposium: Semantic Web for Collaborative Knowledge Acquisition.

[54]  Kirk D. Borne,et al.  Data Mining Research with the LSST , 2007 .

[55]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[56]  Bohdan Paczynski Monitoring All Sky for Variability , 2000 .

[57]  Asu,et al.  Exploration of Large Digital Sky Surveys , 2000, astro-ph/0012489.

[58]  S. Derriere,et al.  Automated object classification with ClassX , 2002 .

[59]  Richard L. White,et al.  DECISION TREES FOR AUTOMATED IDENTIFICATION OF COSMIC-RAY HITS IN HUBBLE SPACE TELESCOPE IMAGES , 1995 .

[60]  A. A. Mahabal,et al.  Searches for Rare and New Types of Objects , 2000 .

[61]  C.D. Borne Data-driven discovery through e-science technologies , 2006, 2nd IEEE International Conference on Space Mission Challenges for Information Technology (SMC-IT'06).

[62]  Robert J. Brunner,et al.  Robust Machine Learning Applied to Astronomical Data Sets. II. Quantifying Photometric Redshifts for Quasars Using Instance-based Learning , 2006, astro-ph/0612471.

[63]  Robert J. Brunner,et al.  Robust Machine Learning Applied to Terascale Astronomical Datasets , 2007, ArXiv.

[64]  H. Kargupta,et al.  Distributed Data Mining for Astronomy Catalogs , 2022 .

[65]  Peter Z. Kunszt,et al.  Data Mining the SDSS SkyServer Database , 2002, WDAS.

[66]  W. Waniak,et al.  Removing cosmic-ray hits from CCD images in real-time mode by means of an artificial neural network , 2007 .

[67]  Usama Fayyad,et al.  Cataloging of the Northern Sky from the POSS-II using a Next-Generation Software Technology , 1994 .

[68]  W. L. Sebok,et al.  Optimal classification of images into stars or galaxies - a Bayesian approach. , 1979 .

[69]  Benjamin M. Good,et al.  Bridging the gap between social tagging and semantic annotation: E.D. the Entity Describer , 2007 .

[70]  Robert J. Brunner,et al.  Massive datasets in astronomy , 2001 .

[71]  S. Okamura,et al.  Galaxy types in the Sloan Digital Sky survey using supervised artificial neural networks , 2003, astro-ph/0306390.

[72]  Michigan.,et al.  Estimating photometric redshifts with artificial neural networks , 2002, astro-ph/0203250.

[73]  Ping Guo,et al.  Automated Separation of Stars and Normal Galaxies Based on Statistical Mixture Modeling with RBF Neural Networks , 2003 .

[74]  NASA Goddard Space Flight Center,et al.  Data Mining in Astronomical Databases , 2000 .

[75]  Ian Witten,et al.  Data Mining , 2000 .