ON MACHINE-LEARNED CLASSIFICATION OF VARIABLE STARS WITH SPARSE AND NOISY TIME-SERIES DATA

With the coming data deluge from synoptic surveys, there is a need for frameworks that can quickly and automatically produce calibrated classification probabilities for newly observed variables based on small numbers of time-series measurements. In this paper, we introduce a methodology for variable-star classification, drawing from modern machine-learning techniques. We describe how to homogenize the information gleaned from light curves by selection and computation of real-numbered metrics (features), detail methods to robustly estimate periodic features, introduce tree-ensemble methods for accurate variable-star classification, and show how to rigorously evaluate a classifier using cross validation. On a 25-class data set of 1542 well-studied variable stars, we achieve a 22.8% error rate using the random forest (RF) classifier; this represents a 24% improvement over the best previous classifier on these data. This methodology is effective for identifying samples of specific science classes: for pulsational variables used in Milky Way tomography we obtain a discovery efficiency of 98.2% and for eclipsing systems we find an efficiency of 99.1%, both at 95% purity. The RF classifier is superior to other methods in terms of accuracy, speed, and relative immunity to irrelevant features; the RF can also be used to estimate the importance of each feature in classification. Additionally, we present the first astronomical use of hierarchical classification methods to incorporate a known class taxonomy in the classifier, which reduces the catastrophic error rate from 8% to 7.8%. Excluding low-amplitude sources, the overall error rate improves to 14%, with a catastrophic error rate of 3.5%.

[1]  N. Lomb Least-squares frequency analysis of unequally spaced data , 1976 .

[2]  J. Scargle Studies in astronomical time series analysis. II - Statistical aspects of spectral analysis of unevenly spaced data , 1982 .

[3]  S. Baliunas,et al.  A Prescription for period analysis of unevenly sampled time series , 1986 .

[4]  P. Burman A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods , 1989 .

[5]  J. Matthews,et al.  High-Speed Photometry of Wolf-Rayet Stars , 1994 .

[6]  Peter B. Stetson,et al.  ON THE AUTOMATIC DETERMINATION OF LIGHT-CURVE PARAMETERS FOR CEPHEID VARIABLES , 1996 .

[7]  Robert Tibshirani,et al.  Classification by Pairwise Coupling , 1997, NIPS.

[8]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[9]  E. Rodriguez,et al.  A revised catalogue of delta Sct stars , 2000 .

[10]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[11]  L. Eyer,et al.  Automated classification of variable stars for All‐Sky Automated Survey 1–2 data , 2001 .

[13]  L. Eyer,et al.  New periodic variables from the Hipparcos epoch photometry , 2002 .

[14]  Neal R. Harvey,et al.  Multimodal approach to feature extraction for image and signal learning problems , 2004, SPIE Optics + Photonics.

[15]  Chih-Jen Lin,et al.  Probability Estimates for Multi-class Classification by Pairwise Coupling , 2003, J. Mach. Learn. Res..

[16]  Yann Le Du,et al.  Lightcurve Classification in Massive Variability Surveys , 2003 .

[17]  N. Wyn Evans,et al.  Light-curve classification in massive variability surveys — I. Microlensing , 2002, astro-ph/0211121.

[18]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[19]  W. T. Vestrand,et al.  Identifying Red Variables in the Northern Sky Variability Survey , 2004 .

[20]  Claudio Gentile,et al.  Incremental Algorithms for Hierarchical Classification , 2004, J. Mach. Learn. Res..

[21]  Richard G. West,et al.  The automated classification of astronomical light curves using Kohonen self-organizing maps , 2004 .

[22]  V. Belokurov,et al.  Light-curve classification in massive variability surveys - II. Transients towards the Large Magellanic Cloud , 2004, astro-ph/0404232.

[23]  P. Protopapas,et al.  Finding outlier light curves in catalogues of periodic variable stars , 2005, astro-ph/0505495.

[24]  Bruce Margon,et al.  A Census of Object Types and Redshift Estimates in the SDSS Photometric Catalog from a Trained Decision-Tree Classifier , 2005 .

[25]  L. Wasserman All of Nonparametric Statistics , 2005 .

[26]  M. Perryman,et al.  The Three-Dimensional Universe with Gaia , 2005 .

[27]  P. Gregory Bayesian Logical Data Analysis for the Physical Sciences: A Comparative Approach with Mathematica® Support , 2005 .

[28]  Gerald Handler,et al.  Catalog of Galactic β Cephei Stars , 2005, astro-ph/0506495.

[29]  Philip C. Gregory,et al.  Bayesian Logical Data Analysis for the Physical Sciences: Acknowledgements , 2005 .

[30]  S. Bailey,et al.  How to Find More Supernovae with Less Work: Object Classification Techniques for Difference Imaging , 2006, 0705.0493.

[31]  C. Aerts,et al.  Astrophysics of Variable Stars , 2006 .

[32]  Robert J. Brunner,et al.  Robust Machine Learning Applied to Astronomical Data Sets. I. Star-Galaxy Classification of the Sloan Digital Sky Survey DR3 Using Decision Trees , 2006, astro-ph/0606541.

[33]  Stergios B. Fotopoulos,et al.  All of Nonparametric Statistics , 2007, Technometrics.

[34]  L. M. Sarro,et al.  Automated supervised classification of variable stars - I. Methodology , 2007, 0711.0703.

[35]  The Wolf-Rayet Stars HD 4004 and HD 50896: Two of a Kind , 2007 .

[36]  Eamonn J. Keogh,et al.  Disk aware discord discovery: finding unusual time series in terabyte sized datasets , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[37]  C. Aerts,et al.  On the co-existence of chemically peculiar Bp stars, slowly pulsating b stars and constant B stars in the same part of the HR diagram , 2007, astro-ph/0702111.

[38]  J. Krist,et al.  HD 97048’s Circumstellar Environment as Revealed by a Hubble Space Telescope ACS Coronagraphic Study of Disk Candidate Stars , 2007, astro-ph/0701576.

[39]  Zeljko Ivezic,et al.  Sloan Digital Sky Survey Standard Star Catalog for Stripe 82: The Dawn of Industrial 1% Optical Photometry , 2007, astro-ph/0703157.

[40]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[41]  Mamoru Doi,et al.  Exploring the Variable Sky with the Sloan Digital Sky Survey , 2007, 0704.0655.

[42]  Andrew A. West,et al.  Stellar SEDs from 0.3 to 2.5 μm: Tracing the Stellar Locus and Searching for Color Outliers in the SDSS and 2MASS , 2007, 0707.4473.

[43]  S. G. Djorgovski,et al.  Automated probabilistic classification of transients and variables , 2008, 0802.3199.

[44]  Saso Dzeroski,et al.  Decision trees for hierarchical multi-label classification , 2008, Machine Learning.

[45]  F. V. Leeuwen,et al.  AGB variables and the Mira period–luminosity relation , 2008, 0801.4465.

[46]  Analysing the Hipparcos epoch photometry of γ Bootis stars , 2008 .

[47]  C. Aerts,et al.  Automated supervised classification of variable stars II. Application to the OGLE database , 2008, 0806.3386.

[48]  D. Sasselov,et al.  MOST Finds No Coherent Oscillations in the Hot Carbon-rich Wolf-Rayet Star HD 165763 (WR 111) , 2008 .

[49]  Laurent Eyer,et al.  Variable stars across the observational HR diagram , 2007, 0712.3797.

[50]  Min-Su Shin,et al.  Detecting Variability in Massive Astronomical Time-Series Data I: application of an infinite Gaussian mixture model , 2009, 0908.2664.

[51]  M. Zechmeister,et al.  The generalised Lomb-Scargle periodogram. A new formalism for the floating-mean and Keplerian periodograms , 2009, 0901.2573.

[52]  P. Dubath,et al.  Variability type classification of multi-epoch surveys , 2009, 0901.2835.

[53]  J. Fernández,et al.  Binarity and multiperiodicity in high-amplitude δ Scuti stars , 2008, 0812.2139.

[54]  N. S. Philip,et al.  Results from the Supernova Photometric Classification Challenge , 2010, 1008.1024.

[55]  V. Grinin,et al.  Mechanism for cyclical activity of the Herbig Ae star BF Ori , 2010 .

[56]  Alex A. Freitas,et al.  A survey of hierarchical classification across different application domains , 2010, Data Mining and Knowledge Discovery.

[57]  Vikram Pudi,et al.  Advances in Knowledge Discovery and Data Mining, 14th Pacific-Asia Conference, PAKDD 2010, Hyderabad, India, June 21-24, 2010. Proceedings. Part I , 2010, PAKDD.

[58]  I. Negueruela,et al.  A serendipitous survey for variability amongst the massive stellar population of Westerlund 1 , 2010, 1003.5107.

[59]  J. De Ridder,et al.  AUTOMATED CLASSIFICATION OF VARIABLE STARS IN THE ASTEROSEISMOLOGY PROGRAM OF THE KEPLER SPACE MISSION , 2010, 1001.0507.

[60]  Nathaniel R. Butler,et al.  OPTIMAL TIME-SERIES SELECTION OF QUASARS , 2010, 1008.3143.

[61]  Nathaniel R. Butler,et al.  CONSTRUCTION OF A CALIBRATED PROBABILISTIC CLASSIFICATION CATALOG: APPLICATION TO 50k VARIABLE SOURCES IN THE ALL-SKY AUTOMATED SURVEY , 2012, 1204.4180.