Unsupervised feature selection using weighted principal components

Research highlights? Unsupervised features selection method is proposed. ? The proposed method can successfully detect true significant features. ? Integration of PCA and control charts techniques. Feature selection has received considerable attention in various areas as a way to select informative features and to simplify the statistical model through dimensional reduction. One of the most widely used methods for dimensional reduction includes principal component analysis (PCA). Despite its popularity, PCA suffers from a lack of interpretability of the original feature because the reduced dimensions are linear combinations of a large number of original features. Traditionally, two or three dimensional loading plots provide information to identify important original features in the first few principal component dimensions. However, the interpretation of what constitutes a loading plot is frequently subjective, particularly when large numbers of features are involved. In this study, we propose an unsupervised feature selection method that combines weighted principal components (PCs) with a thresholding algorithm. The weighted PC is obtained by the weighted sum of the first k PCs of interest. Each of the k loading values in the weighted PC reflects the contribution of each individual feature. We also propose a thresholding algorithm that identifies the significant features. Our experimental results with both the simulated and real datasets demonstrated the effectiveness of the proposed unsupervised feature selection method.

[1]  Jorge Cadima Departamento de Matematica Loading and correlations in the interpretation of principle compenents , 1995 .

[2]  Kezhi Mao,et al.  Identifying critical variables of principal components for unsupervised feature selection , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[3]  Julie Wilson,et al.  Novel feature selection method for genetic programming using metabolomic 1H NMR data , 2006 .

[4]  Chris H. Q. Ding,et al.  Unsupervised Feature Selection Via Two-way Ordering in Gene Expression Analysis , 2003, Bioinform..

[5]  John Wang,et al.  Encyclopedia of Data Warehousing and Mining , 2005 .

[6]  C. Smith Diagnostic tests (1) – sensitivity and specificity , 2012, Phlebology.

[7]  Tom Fearn,et al.  Chemometric Space: Sensitivity and specificity , 2009 .

[8]  Andrew R. Webb,et al.  Statistical Pattern Recognition , 1999 .

[9]  Richard A. Johnson,et al.  Applied Multivariate Statistical Analysis , 1983 .

[10]  Joshua D. Knowles,et al.  Feature subset selection in unsupervised learning via multiobjective optimization , 2006 .

[11]  Laura Maruster,et al.  Encyclopedia of data warehousing and mining , 2008 .

[12]  Ash A. Alizadeh,et al.  'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns , 2000, Genome Biology.

[13]  Ronald J. M. M. Does,et al.  A Comparison of Shewhart Individuals Control Charts Based on Normal, Non‐parametric, and Extreme‐value Theory , 2003 .

[14]  Manoranjan Dash,et al.  Dimensionality reduction of unsupervised data , 1997, Proceedings Ninth IEEE International Conference on Tools with Artificial Intelligence.

[15]  Yajun Mei,et al.  Linear-mixed effects models for feature selection in high-dimensional NMR spectra , 2009, Expert Syst. Appl..

[16]  Flávio Bortolozzi,et al.  Unsupervised feature selection using multi-objective genetic algorithms for handwritten word recognition , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[17]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[18]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[19]  Seoung Bum Kim,et al.  Controlling the False Discovery Rate for Feature Selection in High‐resolution NMR Spectra , 2008, Stat. Anal. Data Min..

[20]  Ian T. Jolliffe,et al.  Variable selection and the interpretation of principal subspaces , 2001 .

[21]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[22]  Seoung Bum Kim,et al.  Genetic algorithm-based feature selection in high-resolution NMR spectra , 2008, Expert Syst. Appl..

[23]  Douglas C. Montgomery,et al.  Research Issues and Ideas in Statistical Process Control , 1999 .

[24]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[25]  Carla E. Brodley,et al.  Feature Subset Selection and Order Identification for Unsupervised Learning , 2000, ICML.

[26]  Age K. Smilde,et al.  Analysis of longitudinal metabolomics data , 2004, Bioinform..

[27]  Young Bun Kim,et al.  Unsupervised Gene Selection For High Dimensional Data , 2006, Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06).

[28]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..