A Randomized Subspace-based Approach for Dimensionality Reduction and Important Variable Selection

An analysis of high dimensional data can offer a detailed description of a system but is often challenged by the curse of dimensionality. General dimensionality reduction techniques can alleviate such difficulty by extracting a few important features, but they are limited due to the lack of interpretability and connectivity to actual decision making associated with each physical variable. Important variable selection techniques, as an alternative, can maintain the interpretability, but they often involve a greedy search that is susceptible to failure in capturing important interactions. This research proposes a new method that produces subspaces, reduced-dimensional physical spaces, based on a randomized search and forms an ensemble of models for critical subspaces. When applied to high-dimensional data collected from a composite metal development process, the proposed method shows its superiority in prediction and important variable selection. keywords: Dimensionality reduction; Important variable selection; Subspace-based modeling

[1]  Russell G. Death,et al.  An accurate comparison of methods for quantifying variable importance in artificial neural networks using simulated data , 2004 .

[2]  Josef Spidlen,et al.  Automated optimized parameters for T-distributed stochastic neighbor embedding improve visualization and analysis of large datasets , 2019, Nature Communications.

[3]  Michel Verleysen,et al.  Nonlinear Dimensionality Reduction , 2021, Computer Vision.

[4]  Pavel Pudil,et al.  Introduction to Statistical Pattern Recognition , 2006 .

[5]  Andrei Zinovyev,et al.  Independent Component Analysis for Unraveling the Complexity of Cancer Omics Datasets , 2019, International journal of molecular sciences.

[6]  Liu Qing,et al.  Fixed-point ICA algorithm for blind separation of complex mixtures containing both circular and noncircular sources , 2016 .

[7]  Ali Bou Nassif,et al.  Dimensionality reduction with IG-PCA and ensemble classifier for network intrusion detection , 2019, Comput. Networks.

[8]  David A. Landgrebe,et al.  Supervised classification in high-dimensional space: geometrical, statistical, and asymptotical properties of multivariate data , 1998, IEEE Trans. Syst. Man Cybern. Part C.

[9]  Lai Guan Ng,et al.  Dimensionality reduction for visualizing single-cell data using UMAP , 2018, Nature Biotechnology.

[10]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[11]  Xin Yao,et al.  A Survey on Evolutionary Computation Approaches to Feature Selection , 2016, IEEE Transactions on Evolutionary Computation.

[12]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[13]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[14]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[15]  Kilian Q. Weinberger,et al.  Spectral Methods for Dimensionality Reduction , 2006, Semi-Supervised Learning.

[16]  Jin Hyun Park,et al.  Fault detection and identification of nonlinear processes based on kernel PCA , 2005 .

[17]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[18]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[19]  Saltelli Andrea,et al.  Global Sensitivity Analysis: The Primer , 2008 .

[20]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[21]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[22]  Nasir Saeed,et al.  A State-of-the-Art Survey on Multidimensional Scaling-Based Localization Techniques , 2019, IEEE Communications Surveys & Tutorials.

[23]  Zhenzhou Lu,et al.  Variable importance analysis: A comprehensive review , 2015, Reliab. Eng. Syst. Saf..

[24]  Tao Zhang,et al.  Software defect prediction based on kernel PCA and weighted extreme learning machine , 2019, Inf. Softw. Technol..

[25]  High-Dimensional Estimation, Basis Assets, and the Adaptive Multi-Factor Model , 2018 .

[26]  D. Basak,et al.  Support Vector Regression , 2008 .

[27]  Jeremiah Zhe Liu Variable Selection with Rigorous Uncertainty Quantification using Bayesian Deep Neural Networks , 2019 .

[28]  Kari Torkkola,et al.  Feature Extraction by Non-Parametric Mutual Information Maximization , 2003, J. Mach. Learn. Res..

[29]  J.C. Rajapakse,et al.  SVM-RFE With MRMR Filter for Gene Selection , 2010, IEEE Transactions on NanoBioscience.

[30]  Alberto D. Pascual-Montano,et al.  A survey of dimensionality reduction techniques , 2014, ArXiv.

[31]  A. Wayne Whitney,et al.  A Direct Method of Nonparametric Measurement Selection , 1971, IEEE Transactions on Computers.

[33]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction , 2018, ArXiv.

[34]  Thar Baker,et al.  Analysis of Dimensionality Reduction Techniques on Big Data , 2020, IEEE Access.

[35]  Xunpeng Shi,et al.  Evaluating energy security of resource-poor economies: A modified principle component analysis approach , 2016 .

[36]  S. TerMaath Probabilistic Multi-Scale Damage Tolerance Modeling of Composite Patches for Naval Aluminum Alloys , 2018 .

[37]  Anil K. Jain,et al.  Dimensionality reduction using genetic algorithms , 2000, IEEE Trans. Evol. Comput..

[38]  Thomas Marill,et al.  On the effectiveness of receptors in recognition systems , 1963, IEEE Trans. Inf. Theory.

[39]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[40]  Rung-Ching Chen,et al.  Selecting critical features for data classification based on machine learning methods , 2020, Journal of Big Data.

[41]  Barbara Hammer,et al.  Parametric nonlinear dimensionality reduction using kernel t-SNE , 2015, Neurocomputing.

[42]  Jon C. Helton,et al.  Conceptual structure and computational organization of the 2008 performance assessment for the proposed high-level radioactive waste repository at Yucca Mountain, Nevada , 2014, Reliab. Eng. Syst. Saf..

[43]  Eric O. Postma,et al.  Dimensionality Reduction: A Comparative Review , 2008 .

[44]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Simon Fong,et al.  Selecting Optimal Feature Set in High-Dimensional Data by Swarm Search , 2013, J. Appl. Math..

[46]  Bertrand Michel,et al.  Correlation and variable importance in random forests , 2013, Statistics and Computing.

[47]  Pavel Paclík,et al.  Adaptive floating search methods in feature selection , 1999, Pattern Recognit. Lett..