Feature selection for high dimensional data in astronomy

Abstract With an exponentially increasing amount of astronomical data, the complexity and dimension of astronomical data are likewise growing rapidly. Extracting information from such data becomes a critical and challenging problem. For example, some algorithms can only be employed in the low-dimensional spaces, so feature selection and feature extraction become important topics. Here we describe the difference between feature selection and feature extraction methods, and introduce the taxonomy of feature selection methods as well as the characteristics of each method. We present a case study comparing the performance and computational cost of different feature selection methods. For the filter method, ReliefF and fisher filter are adopted; for the wrapper method, improved CHAID, linear discriminant analysis (LDA), Naive Bayes (NB) and C4.5 are taken as learners. Applied on the sample, the result indicates that from the viewpoints of computational cost the filter method is superior to the wrapper method. Moreover, different learning algorithms combined with appropriate feature selection methods may arrive at better performance.

[1]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[2]  C. Allende Prieto,et al.  Estimation of stellar atmospheric parameters from SDSS/SEGUE spectra , 2007, astro-ph/0703309.

[3]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[4]  Yanxia Zhang,et al.  Automated clustering algorithms for classification of astronomical objects , 2004, astro-ph/0403431.

[5]  Yong-Heng Zhao,et al.  An automated classification algorithm for multiwavelength data , 2004, SPIE Astronomical Telescopes + Instrumentation.

[6]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[7]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[8]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[9]  Huan Liu,et al.  Feature selection for clustering - a filter solution , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[10]  Alan Bridger,et al.  Optimizing Scientific Return for Astronomy through Information Technologies , 2004 .

[11]  Junxian Wang,et al.  Ensemble Learning for Independent Component Analysis of Normal Galaxy Spectra , 2006 .

[12]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[13]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[14]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[15]  A. Pasquali,et al.  A Principal Component Analysis approach to the Star Formation History of elliptical galaxies in Compact Groups , 2005, astro-ph/0511753.

[16]  Alan J. Miller Subset Selection in Regression , 1992 .

[17]  Thomas G. Dietterich,et al.  Learning Boolean Concepts in the Presence of Many Irrelevant Features , 1994, Artif. Intell..

[18]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.