Machine-assisted discovery of relationships in astronomy

High-volume feature-rich data sets are becoming the bread-and-butter of 21st century astronomy but present significant challenges to scientific discovery. In particular, identifying scientifically significant relationships between sets of parameters is non-trivial. Similar problems in biological and geosciences have led to the development of systems which can explore large parameter spaces and identify potentially interesting sets of associations. In this paper, we describe the application of automated discovery systems of relationships to astronomical data sets, focusing on an evolutionary programming technique and an information-theory technique. We demonstrate their use with classical astronomical relationships – the Hertzsprung–Russell diagram and the Fundamental Plane of elliptical galaxies. We also show how they work with the issue of binary classification which is relevant to the next generation of large synoptic sky surveys, such as the Large Synoptic Survey Telescope (LSST). We find that comparable results to more familiar techniques, such as decision trees, are achievable. Finally, we consider the reality of the relationships discovered and how this can be used for feature selection and extraction.

[1]  S. Djorgovski,et al.  Flashes in a star stream: Automated classification of astronomical transient events , 2012, 2012 IEEE 8th International Conference on E-Science.

[2]  C. Bailer-Jones,et al.  The expected performance of stellar parametrization with Gaia spectrophotometry , 2012, 1207.6005.

[3]  Ashok N. Srivastava,et al.  Advances in Machine Learning and Data Mining for Astronomy , 2012 .

[4]  T. Speed A Correlation for the 21st Century , 2011, Science.

[5]  Michael Mitzenmacher,et al.  Detecting Novel Associations in Large Data Sets , 2011, Science.

[6]  Andrew McWilliam,et al.  RR Lyrae Stars, Metal-Poor Stars, and the Galaxy , 2011, 1109.1324.

[7]  Christopher N. Beaumont,et al.  CLASSIFYING STRUCTURES IN THE INTERSTELLAR MEDIUM WITH SUPPORT VECTOR MACHINES: THE G16.05-0.57 SUPERNOVA REMNANT , 2011, 1107.5584.

[8]  M. Catelán,et al.  RR Lyrae Period-Amplitude Diagrams: From Bailey to Today , 2011, 1106.4809.

[9]  B. Sesar Mapping the Galactic Halo with SDSS, LINEAR and PTF RR Lyrae Stars , 2011, 1105.4146.

[10]  Joshua S. Bloom,et al.  Data Mining and Machine-Learning in Time-Domain Discovery & Classification , 2011, 1104.3142.

[11]  R. Ibata,et al.  AAOmega spectroscopy of 29 351 stars in fields centered on ten Galactic globular clusters , 2011, 1104.2628.

[12]  P. Dubath,et al.  Random forest automated supervised classification of Hipparcos periodic variable stars , 2011, 1101.2406.

[13]  J. Richards,et al.  ON MACHINE-LEARNED CLASSIFICATION OF VARIABLE STARS WITH SPARSE AND NOISY TIME-SERIES DATA , 2011, 1101.1959.

[14]  E. C. Vasconcellos,et al.  DECISION TREE CLASSIFIERS FOR STAR/GALAXY SEPARATION , 2010, 1011.1951.

[15]  Nathaniel R. Butler,et al.  OPTIMAL TIME-SERIES SELECTION OF QUASARS , 2010, 1008.3143.

[16]  N. S. Philip,et al.  Results from the Supernova Photometric Classification Challenge , 2010, 1008.1024.

[17]  J. Eisert,et al.  Extracting dynamical equations from experimental data is NP hard. , 2010, Physical review letters.

[18]  Maria Liakata,et al.  Towards Robot Scientists for autonomous scientific discovery , 2010, Automated experimentation.

[19]  Jayanta Dutta,et al.  SEARCH FOR CHAOS IN NEUTRON STAR SYSTEMS: IS Cyg X-3 A BLACK HOLE? , 2009, 0911.1701.

[20]  P. Stetson,et al.  A NEW COLOR–MAGNITUDE DIAGRAM FOR 47 TUCANAE: A STATISTICAL ANALYSIS , 2009 .

[21]  B. Skiff,et al.  VizieR Online Data Catalog , 2009 .

[22]  Canada.,et al.  Data Mining and Machine Learning in Astronomy , 2009, 0906.2173.

[23]  Hod Lipson,et al.  Distilling Free-Form Natural Laws from Experimental Data , 2009, Science.

[24]  Heidi Jo Newberg,et al.  SEGUE: A SPECTROSCOPIC SURVEY OF 240,000 STARS WITH g = 14–20 , 2009, 0902.1781.

[25]  L. Zaninetti SEMI-ANALYTICAL FORMULAS FOR THE HERTZSPRUNG-RUSSELL DIAGRAM , 2008, 0811.4524.

[26]  A. J. Drake,et al.  FIRST RESULTS FROM THE CATALINA REAL-TIME TRANSIENT SURVEY , 2008, 0809.1394.

[27]  B. Gibson,et al.  THE RADIAL VELOCITY EXPERIMENT (RAVE): SECOND DATA RELEASE , 2008, 0806.0546.

[28]  Mamoru Doi,et al.  The Milky Way Tomography with SDSS. II. Stellar Metallicity , 2008, 0804.3850.

[29]  L. M. Sarro,et al.  Automated supervised classification of variable stars - I. Methodology , 2007, 0711.0703.

[30]  M. Montuori,et al.  Basic properties of galaxy clustering in the light of recent results from the Sloan Digital Sky Survey , 2005, astro-ph/0501583.

[31]  Christopher H. Bryant,et al.  Functional genomic hypothesis generation and experimentation by a robot scientist , 2004, Nature.

[32]  J. Hurley,et al.  Impersonal parameters from Hertzsprung-Russell diagrams , 2003 .

[33]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[34]  Brian D. Warner,et al.  The Society for Astronomical Sciences 21st Annual Symposium on Telescope Science , 2002 .

[35]  F. Ochsenbein,et al.  The VizieR database of astronomical catalogues , 2000, astro-ph/0002122.

[36]  J. Lucey,et al.  The fundamental relations of elliptical galaxies , 1993 .

[37]  Frederic H. Chaffee,et al.  An objective classification scheme for QSO spectra , 1992 .

[38]  R. Davies,et al.  Spectroscopy and photometry of elliptical galaxies. I: a new distance estimator , 1987 .

[39]  S. Djorgovski,et al.  Fundamental Properties of Elliptical Galaxies , 1987 .

[40]  J. Scargle Studies in astronomical time series analysis. II - Statistical aspects of spectral analysis of unevenly spaced data , 1982 .

[41]  N. Lomb Least-squares frequency analysis of unequally spaced data , 1976 .

[42]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[43]  Teuvo Kohonen,et al.  Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.

[44]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[45]  Thomas M. Cover,et al.  The Best Two Independent Measurements Are Not the Two Best , 1974, IEEE Trans. Syst. Man Cybern..