Using machine learning for discovery in synoptic survey imaging data

Modern time-domain surveys continuously monitor large swaths of the sky to look for astronomical variability. Astrophysical discovery in such data sets is complicated by the fact that detections of real transient and variable sources are highly outnumbered by bogus detections caused by imperfect subtractions, atmospheric effects and detector artefacts. In this work we present a machine learning (ML) framework for discovery of variability in time-domain imaging surveys. Our ML methods provide probabilistic statements, in near real time, about the degree to which each newly observed source is astrophysically relevant source of variable brightness. We provide details about each of the analysis steps involved, including compilation of the training and testing sets, construction of descriptive image-based and contextual features, and optimization of the feature subset and model tuning parameters. Using a validation set of nearly 30,000 objects from the Palomar Transient Factory, we demonstrate a missed detection rate of at most 7.7% at our chosen false-positive rate of 1% for an optimized ML classifier of 23 features, selected to avoid feature correlation and over-fitting from an initial library of 42 attributes. Importantly, we show that our classification methodology is insensitive to mis-labelled training data up to a contamination of nearly 10%, making it easier to compile sufficient training sets for accurate performance in future surveys. This ML framework, if so adopted, should enable the maximization of scientific gain from future synoptic survey and enable fast follow-up decisions on the vast amounts of streaming data produced by such experiments.

[1]  O. N. Garcia,et al.  Knowledge and Data Engineering: An Outlook , 1989 .

[2]  C. V. Ramamoorthy,et al.  Knowledge and Data Engineering , 1989, IEEE Trans. Knowl. Data Eng..

[3]  E. Bertin,et al.  SExtractor: Software for source extraction , 1996 .

[4]  F. Ochsenbein,et al.  The VizieR database of astronomical catalogues , 2000, astro-ph/0002122.

[5]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[8]  S. Bailey,et al.  How to Find More Supernovae with Less Work: Object Classification Techniques for Difference Imaging , 2006, 0705.0493.

[9]  Achim Zeileis,et al.  BMC Bioinformatics BioMed Central Methodology article Conditional variable importance for random forests , 2008 .

[10]  Ernest E. Croner,et al.  The Palomar Transient Factory: System Overview, Performance, and First Results , 2009, 0906.5350.

[11]  Canada.,et al.  Data Mining and Machine Learning in Astronomy , 2009, 0906.2173.

[12]  Oxford,et al.  Exploring the Optical Transient Sky with the Palomar Transient Factory , 2009, 0906.5355.

[13]  Alexander S. Szalay,et al.  RANDOM FORESTS FOR PHOTOMETRIC REDSHIFTS , 2010 .

[14]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[15]  Chad M. Schafer,et al.  Semi-supervised learning for photometric supernova classification★ , 2011, 1103.6034.

[16]  P. Dubath,et al.  Random forest automated supervised classification of Hipparcos periodic variable stars , 2011, 1101.2406.

[17]  S. Aigrain,et al.  A Gaussian process framework for modelling instrumental systematics: application to transmission spectroscopy , 2011, 1109.3251.

[18]  Pavlos Protopapas,et al.  QSO Selection Algorithm Using Time Variability and Machine Learning: Selection of 1,620 QSO Candidates from MACHO LMC Database , 2011, 1101.3316.

[19]  J. Richards,et al.  ON MACHINE-LEARNED CLASSIFICATION OF VARIABLE STARS WITH SPARSE AND NOISY TIME-SERIES DATA , 2011, 1101.1959.

[20]  Pavlos Protopapas,et al.  QUASI-STELLAR OBJECT SELECTION ALGORITHM USING TIME VARIABILITY AND MACHINE LEARNING: SELECTION OF 1620 QUASI-STELLAR OBJECT CANDIDATES FROM MACHO LARGE MAGELLANIC CLOUD DATABASE , 2011 .

[21]  Gérard Biau,et al.  Analysis of a Random Forests Model , 2010, J. Mach. Learn. Res..

[22]  Tamara Broderick,et al.  RAPID, MACHINE-LEARNED RESOURCE ALLOCATION: APPLICATION TO HIGH-REDSHIFT GAMMA-RAY BURST FOLLOW-UP , 2011, 1112.3654.

[23]  Adam A. Miller,et al.  ACTIVE LEARNING TO OVERCOME SAMPLE SELECTION BIAS: APPLICATION TO PHOTOMETRIC VARIABLE STAR CLASSIFICATION , 2011, 1106.2832.

[24]  E. O. Ofek,et al.  Automating Discovery and Classification of Transients and Variable Stars in the Synoptic Survey Era , 2011, 1106.5491.

[25]  Ingo P. Waldmann,et al.  OF “COCKTAIL PARTIES” AND EXOPLANETS , 2011, 1106.1989.