CLaSPS: A NEW METHODOLOGY FOR KNOWLEDGE EXTRACTION FROM COMPLEX ASTRONOMICAL DATA SETS

In this paper, we present the Clustering-Labels-Score Patterns Spotter (CLaSPS), a new methodology for the determination of correlations among astronomical observables in complex data sets, based on the application of distinct unsupervised clustering techniques. The novelty in CLaSPS is the criterion used for the selection of the optimal clusterings, based on a quantitative measure of the degree of correlation between the cluster memberships and the distribution of a set of observables, the labels, not employed for the clustering. CLaSPS has been primarily developed as a tool to tackle the challenging complexity of the multi-wavelength complex and massive astronomical data sets produced by the federation of the data from modern automated astronomical facilities. In this paper, we discuss the applications of CLaSPS to two simple astronomical data sets, both composed of extragalactic sources with photometric observations at different wavelengths from large area surveys. The first data set, CSC+, is composed of optical quasars spectroscopically selected in the Sloan Digital Sky Survey data, observed in the x-rays by Chandra and with multi-wavelength observations in the near-infrared, optical, and ultraviolet spectral intervals. One of the results of the application of CLaSPS to the CSC+ is the re-identification of a well-known correlation between the α_(OX) parameter and the near-ultraviolet color, in a subset of CSC+ sources with relatively small values of the near-ultraviolet colors. The other data set consists of a sample of blazars for which photometric observations in the optical, mid-, and near-infrared are available, complemented for a subset of the sources, by Fermi γ-ray data. The main results of the application of CLaSPS to such data sets have been the discovery of a strong correlation between the multi-wavelength color distribution of blazars and their optical spectral classification in BL Lac objects and flat-spectrum radio quasars, and a peculiar pattern followed by blazars in the WISE mid-infrared colors space. This pattern and its physical interpretation have been discussed in detail in other papers by one of the authors.

[1]  Aniruddha R. Thakar,et al.  ERRATUM: “THE EIGHTH DATA RELEASE OF THE SLOAN DIGITAL SKY SURVEY: FIRST DATA FROM SDSS-III” (2011, ApJS, 193, 29) , 2011 .

[2]  Teuvo Kohonen,et al.  The self-organizing map , 1990 .

[3]  M. Way,et al.  STRUCTURE IN THE THREE-DIMENSIONAL GALAXY DISTRIBUTION. I. METHODS AND EXAMPLE RESULTS , 2010, 1009.0387.

[4]  Jeffrey D. Scargle,et al.  Statistical challenges in modern astronomy II , 1997 .

[5]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[6]  Astronomy,et al.  Photometric Redshift Estimation Using Spectral Connectivity Analysis , 2009, 0906.0995.

[7]  A. Szalay,et al.  The Galaxy Evolution Explorer: A Space Ultraviolet Survey Mission , 2004, astro-ph/0411302.

[8]  A. Raftery,et al.  Three Types of Gamma-Ray Bursts , 1998, astro-ph/9802085.

[9]  Yong Hu,et al.  The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature , 2011, Decis. Support Syst..

[10]  M. Irwin,et al.  The UKIRT Infrared Deep Sky Survey (UKIDSS) , 2006, astro-ph/0604426.

[11]  Joydeep Ghosh,et al.  Cluster ensembles , 2011, Data Clustering: Algorithms and Applications.

[12]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[13]  Takamitsu Miyaji,et al.  THE CHANDRA COSMOS SURVEY. I. OVERVIEW AND POINT SOURCE CATALOG , 2009, 0903.2062.

[14]  N. Davey,et al.  Photometric redshift estimation using Gaussian processes , 2009 .

[15]  Vlasios Vasileiou,et al.  FERMI LARGE AREA TELESCOPE SECOND SOURCE CATALOG , 2011 .

[16]  James E. Geach,et al.  Unsupervised self-organized mapping: a versatile empirical tool for object selection, classification and redshift estimation in large surveys , 2011, 1110.0005.

[17]  M. Way,et al.  Novel Methods for Predicting Photometric Redshifts from Broadband Photometry Using Virtual Sensors , 2006 .

[18]  Li Xiu,et al.  Application of data mining techniques in customer relationship management: A literature review and classification , 2009, Expert Syst. Appl..

[19]  Jonathan C. McDowell,et al.  THE CHANDRA SOURCE CATALOG , 2009, 1005.4665.

[20]  Mahdi Bazarghan,et al.  Application of self-organizing map to stellar spectral classifications , 2012 .

[21]  N. A. Walton,et al.  Quasar candidates selection in the Virtual Observatory era , 2008, 0805.0156.

[22]  Christopher J. Fluke,et al.  Scientific Visualization in Astronomy: Towards the Petascale Astronomy Era , 2011, Publications of the Astronomical Society of Australia.

[23]  Esa Alhoniemi,et al.  Clustering of the self-organizing map , 2000, IEEE Trans. Neural Networks Learn. Syst..

[24]  R. D’Abrusco,et al.  INFRARED COLORS OF THE GAMMA-RAY-DETECTED BLAZARS , 2012, 1203.0568.

[25]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[26]  Richard L. White,et al.  The FIRST Survey: Faint Images of the Radio Sky at twenty centimeters , 1995 .

[27]  Ofer Lahav,et al.  ANNz: Estimating Photometric Redshifts Using Artificial Neural Networks , 2004 .

[28]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[29]  H. Fu,et al.  2MASS observation of BL Lac objects II , 2005 .

[30]  Gutti Jogesh Babu,et al.  Statistical Challenges in Modern Astronomy , 1992 .

[31]  M. Skrutskie,et al.  The Two Micron All Sky Survey (2MASS) , 2006 .

[32]  Raffaele D'Abrusco,et al.  Astroinformatics of galaxies and quasars: a new general method for photometric redshifts estimation , 2011, 1107.3160.

[33]  R. D’Abrusco,et al.  THE WISE GAMMA-RAY STRIP PARAMETERIZATION: THE NATURE OF THE GAMMA-RAY ACTIVE GALACTIC NUCLEI OF UNCERTAIN TYPE , 2012, 1203.1330.

[34]  A. Szalay,et al.  GALEX–SDSS CATALOGS FOR STATISTICAL STUDIES , 2009, 0904.1392.

[35]  P. Protopapas,et al.  Finding outlier light curves in catalogues of periodic variable stars , 2005, astro-ph/0505495.

[36]  R. D’Abrusco,et al.  IDENTIFICATION OF THE INFRARED NON-THERMAL EMISSION IN BLAZARS , 2011, 1203.0304.

[37]  J. Natarajan,et al.  Knowledge Discovery in Biology and Biotechnology Texts: A Review of Techniques, Evaluation Strategies, and Applications , 2005, Critical reviews in biotechnology.

[38]  Ugo Becciani,et al.  Visualization, Exploration, and Data Analysis of Complex Astrophysical Data , 2007, 0707.2474.

[39]  G. Trinchieri,et al.  A statistical analysis of the Einstein normal galaxy sample. I - Spiral and irregular galaxies , 1985 .

[40]  Patrick Petitjean,et al.  Artificial neural networks for quasar selection and photometric redshift determination , 2010 .

[41]  M. J. Way,et al.  Can Self-Organizing Maps Accurately Predict Photometric Redshifts? , 2012 .

[42]  M. Salvato,et al.  The X-ray to optical-UV luminosity ratio of X-ray selected type 1 AGN in XMM-COSMOS , 2009, 0912.4166.

[43]  X-ray emission from radio-quiet quasars in the SDSS Early Data Release. The alpha_ox dependence upon UV luminosity , 2002, astro-ph/0211125.

[44]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[45]  H. Tananbaum,et al.  On the cosmological evolution of the X-ray emission from quasars , 1982 .

[46]  S. Djorgovski,et al.  Fundamental Properties of Elliptical Galaxies , 1987 .

[47]  M. Fukugita,et al.  The Sloan Digital Sky Survey Photometric System , 1996 .

[48]  Canada.,et al.  Data Mining and Machine Learning in Astronomy , 2009, 0906.2173.

[49]  A. Shapley,et al.  A Multivariate Statistical Analysis of Spiral Galaxy Luminosities. II. Morphology-dependent Multiwavelength Emission Properties , 2001, astro-ph/0107244.