Supervised projection pursuit – A dimensionality reduction technique optimized for probabilistic classification

Abstract An important step in multivariate analysis is the dimensionality reduction, which allows for a better classification and easier visualization of the class structures in the data. Techniques like PCA, PLS-DA and LDA are most often used to explore the patterns in the data and to reduce the dimensions. Yet the data does not always reveal properly the structures wen these techniques are applied. To this end, a supervised projection pursuit (SuPP) is proposed in this article, based on Jensen-Shannon divergence. The combination of this metric with powerful Monte Carlo based optimization algorithm, yielded a versatile dimensionality reduction technique capable of working with highly dimensional data and missing observations. Combined with Naive Bayes (NB) classifier, SuPP proved to be a powerful preprocessing tool for classification. Namely, on the Iris data set, the prediction accuracy of SuPP-NB is significantly higher than the prediction accuracy of PCA-NB, (p-value ≤ 4.02E-05 in a 2D latent space, p-value ≤ 3.00E-03 in a 3D latent space) and significantly higher than the prediction accuracy of PLS-DA (p-value ≤ 1.17E-05 in a 2D latent space and p-value ≤ 3.08E-03 in a 3D latent space). The significantly higher accuracy for this particular data set is a strong evidence of a better class separation in the latent spaces obtained with SuPP.

[1]  D. Massart,et al.  Sequential projection pursuit using genetic algorithms for data mining of analytical data. , 2000, Analytical chemistry.

[2]  H. Scheraga,et al.  Monte Carlo-minimization approach to the multiple-minima problem in protein folding. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[3]  E. Anderson The Species Problem in Iris , 1936 .

[4]  E. Oja,et al.  Independent Component Analysis , 2013 .

[5]  M. Powell A Direct Search Optimization Method That Models the Objective and Constraint Functions by Linear Interpolation , 1994 .

[6]  Edwin D. Mares,et al.  On S , 1994, Stud Logica.

[7]  J. A. Branco,et al.  Projection-pursuit approach to robust linear discriminant analysis , 2010, J. Multivar. Anal..

[8]  AdrianP. Wade,et al.  PARVUS An Extendable package of Programs for Data Exploration, Classification of Correlation , 1989 .

[9]  Flemming Topsøe,et al.  Jensen-Shannon divergence and Hilbert space embedding , 2004, International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings..

[10]  Kunal Roy,et al.  The “double cross-validation” software tool for MLR QSAR model development , 2016 .

[12]  David I. Ellis,et al.  A tutorial review: Metabolomics and partial least squares-discriminant analysis--a marriage of convenience or a shotgun wedding. , 2015, Analytica chimica acta.

[13]  U. Alon,et al.  Transcriptional gene expression profiles of colorectal adenoma, adenocarcinoma, and normal tissue examined by oligonucleotide arrays. , 2001, Cancer research.

[14]  J. Friedman,et al.  Projection Pursuit Regression , 1981 .

[15]  R. Tibshirani,et al.  Discriminant Analysis by Gaussian Mixtures , 1996 .

[16]  Qiang Zhang,et al.  Comparison of Different Classification Methods for Analyzing Electronic Nose Data to Characterize Sesame Oils and Blends , 2015, Sensors.

[17]  Bogdan Raducanu,et al.  A supervised non-linear dimensionality reduction approach for manifold learning , 2012, Pattern Recognit..

[18]  Peter D. Wentzell,et al.  Regularized projection pursuit for data with a small sample-to-variable ratio , 2013, Metabolomics.

[19]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[20]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[21]  Travis E. Oliphant,et al.  Python for Scientific Computing , 2007, Computing in Science & Engineering.

[22]  Reinhard Laubenbacher,et al.  Comparative Analysis of Linear and Nonlinear Dimension Reduction Techniques on Mass Cytometry Data , 2018, bioRxiv.

[23]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[24]  J. Kruskal TOWARD A PRACTICAL METHOD WHICH HELPS UNCOVER THE STRUCTURE OF A SET OF MULTIVARIATE OBSERVATIONS BY FINDING THE LINEAR TRANSFORMATION WHICH OPTIMIZES A NEW “INDEX OF CONDENSATION” , 1969 .

[25]  Robert Jenssen,et al.  Noisy multi-label semi-supervised dimensionality reduction , 2019, Pattern Recognit..

[26]  Pong C. Yuen,et al.  Face Recognition by Regularized Discriminant Analysis , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[27]  Alireza Akhbardeh,et al.  Comparative analysis of nonlinear dimensionality reduction techniques for breast MRI segmentation. , 2012, Medical physics.

[28]  Elif Vural,et al.  Nonlinear Supervised Dimensionality Reduction via Smooth Regular Embeddings , 2017, Pattern Recognit..

[29]  Robin Sibson,et al.  What is projection pursuit , 1987 .

[30]  Feiping Nie,et al.  Multiple view semi-supervised dimensionality reduction , 2010, Pattern Recognit..

[31]  Neil Davey,et al.  Analysis of linear and nonlinear dimensionality reduction methods for gender classification of face images , 2005, Int. J. Syst. Sci..

[32]  Dao-Qing Dai,et al.  Regularized coplanar discriminant analysis for dimensionality reduction , 2017, Pattern Recognit..

[33]  A. Errity,et al.  A Comparative Study of Linear and Nonlinear Dimensionality Reduction for Speaker Identification , 2007, 2007 15th International Conference on Digital Signal Processing.

[34]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[35]  Riccardo Leardi,et al.  PARVUS: An Extendable Package of Programs for Data Exploration , 1988 .

[36]  J. Ruscio,et al.  A probability-based measure of effect size: robustness to base rates and other factors. , 2008, Psychological methods.

[37]  John W. Tukey,et al.  A Projection Pursuit Algorithm for Exploratory Data Analysis , 1974, IEEE Transactions on Computers.

[38]  Age K. Smilde,et al.  UvA-DARE ( Digital Academic Repository ) Assessment of PLSDA cross validation , 2008 .

[39]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[40]  Frank P. Ferrie,et al.  Pareto models for discriminative multiclass linear dimensionality reduction , 2015, Pattern Recognit..

[41]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[42]  Eun-kyung Lee,et al.  Projection pursuit methods for exploratory supervised classification , 2003 .

[43]  A. Pires,et al.  Robust Linear Discriminant Analysis and the Projection Pursuit Approach , 2003 .

[44]  Dianne Cook,et al.  Projection Pursuit for Exploratory Supervised Classification , 2005 .

[45]  Peter Harremoës,et al.  Properties of Classical and Quantum Jensen-Shannon Divergence , 2009 .