Machine learning in APOGEE: Unsupervised spectral classification with K-means

The data volume generated by astronomical surveys is growing rapidly. Traditional analysis techniques in spectroscopy either demand intensive human interaction or are computationally expensive. In this scenario, machine learning, and unsupervised clustering algorithms in particular offer interesting alternatives. The Apache Point Observatory Galactic Evolution Experiment (APOGEE) offers a vast data set of near-infrared stellar spectra which is perfect for testing such alternatives. Apply an unsupervised classification scheme based on $K$-means to the massive APOGEE data set. Explore whether the data are amenable to classification into discrete classes. We apply the $K$-means algorithm to 153,847 high resolution spectra ($R\approx22,500$). We discuss the main virtues and weaknesses of the algorithm, as well as our choice of parameters. We show that a classification based on normalised spectra captures the variations in stellar atmospheric parameters, chemical abundances, and rotational velocity, among other factors. The algorithm is able to separate the bulge and halo populations, and distinguish dwarfs, sub-giants, RC and RGB stars. However, a discrete classification in flux space does not result in a neat organisation in the parameters space. Furthermore, the lack of obvious groups in flux space causes the results to be fairly sensitive to the initialisation, and disrupts the efficiency of commonly-used methods to select the optimal number of clusters. Our classification is publicly available, including extensive online material associated with the APOGEE Data Release 12 (DR12). Our description of the APOGEE database can enormously help with the identification of specific types of targets for various applications. We find a lack of obvious groups in flux space, and identify limitations of the $K$-means algorithm in dealing with this kind of data.

[1]  A. J. Connolly,et al.  REDUCING THE DIMENSIONALITY OF DATA: LOCALLY LINEAR EMBEDDING OF SLOAN GALAXY SPECTRA , 2009, 0907.2238.

[2]  Sahar Shahaf,et al.  Detecting outliers and learning complex structures with large spectroscopic surveys - a case study with APOGEE stars , 2017, 1711.00022.

[3]  Lars Koesterke,et al.  THE APOGEE RED-CLUMP CATALOG: PRECISE DISTANCES, VELOCITIES, AND HIGH-RESOLUTION ELEMENTAL ABUNDANCES OVER A LARGE AREA OF THE MILKY WAY'S DISK , 2014, 1405.1032.

[4]  Matthew D. Shetrone,et al.  INFRARED HIGH-RESOLUTION INTEGRATED LIGHT SPECTRAL ANALYSES OF M31 GLOBULAR CLUSTERS FROM APOGEE , 2016, 1607.06811.

[5]  Harinder P. Singh,et al.  Stellar spectral classification using principal component analysis and artificial neural networks , 1998 .

[6]  Alejandra Rodríguez,et al.  Automated knowledge-based analysis and classification of stellar spectra using fuzzy reasoning , 2004, Expert Syst. Appl..

[7]  Tenerife,et al.  SYSTEMATIC SEARCH FOR EXTREMELY METAL-POOR GALAXIES IN THE SLOAN DIGITAL SKY SURVEY , 2011, 1109.0235.

[8]  Keivan G. Stassun,et al.  The 13th Data Release of the Sloan Digital Sky Survey: First Spectroscopic Data from the SDSS-IV Survey Mapping Nearby Galaxies at Apache Point Observatory , 2016, 1608.02013.

[9]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[10]  University of Michigan,et al.  Accepted for publication in ApJ Letters Preprint typeset using L ATEX style emulateapj v. 03/07/07 TRACING THE GALACTIC THICK DISK TO SOLAR METALLICITIES 1 , 2022 .

[11]  Robert Barkhouser,et al.  The Apache Point Observatory Galactic Evolution Experiment (APOGEE) , 2007 .

[12]  V. Narayanan,et al.  Spectroscopic Target Selection for the Sloan Digital Sky Survey: The Luminous Red Galaxy Sample , 2001, astro-ph/0108153.

[13]  Tenerife,et al.  Automatic unsupervised classification of all SDSS/DR7 galaxy spectra , 2010, 1003.3186.

[14]  F. Bonnarel,et al.  The SIMBAD astronomical database. The CDS reference database for astronomical objects , 2000, astro-ph/0002110.

[15]  J L Challifour An initiation in physics. , 1992, Science.

[16]  Michael Wegner,et al.  Ground-based and Airborne Instrumentation for Astronomy III , 2010 .

[17]  T. Lingham‐Soliar,et al.  Origin and evolution , 2014 .

[18]  M. Pinsonneault,et al.  FAST STAR, SLOW STAR; OLD STAR, YOUNG STAR: SUBGIANT ROTATION AS A POPULATION AND STELLAR PHYSICS DIAGNOSTIC , 2013, 1306.3701.

[19]  Hilo,et al.  THE ELEVENTH AND TWELFTH DATA RELEASES OF THE SLOAN DIGITAL SKY SURVEY: FINAL DATA FROM SDSS-III , 2015, 1501.00963.

[20]  F. Castelli,et al.  NEW H-BAND STELLAR SPECTRAL LIBRARIES FOR THE SDSS-III/APOGEE SURVEY , 2015, 1502.05237.

[21]  Germany,et al.  SEARCH FOR EXTREMELY METAL-POOR GALAXIES IN THE SLOAN DIGITAL SKY SURVEY. II. HIGH ELECTRON TEMPERATURE OBJECTS , 2016, 1601.01631.

[22]  Tenerife,et al.  Search for Blue Compact Dwarf Galaxies During Quiescence. II. Metallicities of Gas and Stars, Ages, and Star Formation Rates , 2009 .

[23]  Jan Swevers,et al.  Ground-based and airborne instrumentation for astronomy , 2010 .

[24]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[25]  M. Carr,et al.  Performance of the Apache Point Observatory Galactic Evolution Experiment (APOGEE) high-resolution near-infrared multi-object fiber spectrograph , 2012, Other Conferences.

[26]  Antonio Mampaso,et al.  Automatic spectral classification of stellar spectra with low signal-to-noise ratio using artificial neural networks , 2012 .

[27]  David H. Wolpert,et al.  The Lack of A Priori Distinctions Between Learning Algorithms , 1996, Neural Computation.

[28]  B. Yanny,et al.  A Spectroscopic Study of the Ancient Milky Way: F- and G-Type Stars in the Third Data Release of the Sloan Digital Sky Survey , 2005, astro-ph/0509812.

[29]  Casiana Muñoz-Tuñón,et al.  AUTOMATIC UNSUPERVISED CLASSIFICATION OF ALL SLOAN DIGITAL SKY SURVEY DATA RELEASE 7 GALAXY SPECTRA , 2010 .

[30]  Thomas Bensby,et al.  Elemental abundance trends in the Galactic thin and thick disks as traced by nearby F and G dwarf stars , 2003 .

[31]  Andrew J. Connolly,et al.  CLASSIFICATION OF STELLAR SPECTRA WITH LOCAL LINEAR EMBEDDING , 2011 .

[32]  Observatoire de la Côte d'Azur,et al.  Gaia Data Release 1. Summary of the astrometric, photometric, and survey properties , 2016, 1609.04172.

[33]  Ted von Hippel,et al.  Automated classification of stellar spectra - II. Two-dimensional classification with neural networks and principal components analysis , 1998, astro-ph/9803050.

[34]  Annie C. Robin,et al.  ABUNDANCES, STELLAR PARAMETERS, AND SPECTRA FROM THE SDSS-III/APOGEE SURVEY , 2015, 1501.04110.

[35]  Tenerife,et al.  Automated unsupervised classification of the Sloan Digital Sky Survey stellar spectra using k-means clustering , 2012, 1211.5321.

[36]  Scott W. Fleming,et al.  THE DATA REDUCTION PIPELINE FOR THE APACHE POINT OBSERVATORY GALACTIC EVOLUTION EXPERIMENT , 2015, 1501.03742.

[37]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[38]  Nicholas Troup,et al.  ASPCAP: THE APOGEE STELLAR PARAMETER AND CHEMICAL ABUNDANCES PIPELINE , 2015, 1510.07635.

[39]  C. Allende Prieto,et al.  TARGET SELECTION FOR THE APACHE POINT OBSERVATORY GALACTIC EVOLUTION EXPERIMENT (APOGEE) , 2013 .

[40]  Alejandra Rodríguez,et al.  STARMIND: A FUZZY LOGIC KNOWLEDGE-BASED SYSTEM FOR THE AUTOMATED CLASSIFICATION OF STARS IN THE MK SYSTEM , 2009 .