A Projection Pursuit framework for supervised dimension reduction of high dimensional small sample datasets

The analysis and interpretation of datasets with large number of features and few examples has remained as a challenging problem in the scientific community, owing to the difficulties associated with the curse-of-the-dimensionality phenomenon. Projection Pursuit (PP) has shown promise in circumventing this phenomenon by searching low-dimensional projections of the data where meaningful structures are exposed. However, PP faces computational difficulties in dealing with datasets containing thousands of features (typical in genomics and proteomics) due to the vast quantity of parameters to optimize. In this paper we describe and evaluate a PP framework aimed at relieving such difficulties and thus ease the construction of classifier systems. The framework is a two-stage approach, where the first stage performs a rapid compaction of the data and the second stage implements the PP search using an improved version of the SPP method (Guo et al., 2000, 32). In an experimental evaluation with eight public microarray datasets we showed that some configurations of the proposed framework can clearly overtake the performance of eight well-established dimension reduction methods in their ability to pack more discriminatory information into fewer dimensions.

[1]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Li Bai,et al.  Face Verification Using Indirect Neighbourhood Components Analysis , 2010, ISVC.

[3]  Jun Guo,et al.  The small sample size problem of ICA: A comparative study and analysis , 2012, Pattern Recognit..

[4]  T. Golub,et al.  Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. , 2003, Cancer research.

[5]  Carlos Soares,et al.  Ranking Learning Algorithms: Using IBL and Meta-Learning on Accuracy and Time Results , 2003, Machine Learning.

[6]  Lipo Wang,et al.  A Modified T-test Feature Selection Method and Its Application on the HapMap Genotype Data , 2008, Genom. Proteom. Bioinform..

[7]  M. Aladjem Projection pursuit mixture density estimation , 2005, IEEE Transactions on Signal Processing.

[8]  Aapo Hyvärinen,et al.  Equivalence of Some Common Linear Feature Extraction Techniques for Appearance-based Object Recognition Tasks , 2022 .

[9]  Geoffrey E. Hinton,et al.  Neighbourhood Components Analysis , 2004, NIPS.

[10]  Javier Rojo,et al.  Dimension Reduction of microarray Gene Expression Data: the Accelerated Failure Time Model , 2009, J. Bioinform. Comput. Biol..

[11]  G. Nason Three‐Dimensional Projection Pursuit , 1995 .

[12]  Han-Ming Wu Kernel Sliced Inverse Regression with Applications to Classification , 2008 .

[13]  Robin Sibson,et al.  What is projection pursuit , 1987 .

[14]  Christos Faloutsos,et al.  On the 'Dimensionality Curse' and the 'Self-Similarity Blessing' , 2001, IEEE Trans. Knowl. Data Eng..

[15]  Wen Gao,et al.  Classifiability-Based Discriminatory Projection Pursuit , 2011, IEEE Transactions on Neural Networks.

[16]  C. Posse Projection pursuit exploratory data analysis , 1995 .

[17]  José A. Malpica,et al.  A projection pursuit algorithm for anomaly detection in hyperspectral imagery , 2008, Pattern Recognit..

[18]  E. Gehan,et al.  The properties of high-dimensional data spaces: implications for exploring gene and protein expression data , 2008, Nature Reviews Cancer.

[19]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[20]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Roman Rosipal,et al.  Overview and Recent Advances in Partial Least Squares , 2005, SLSFS.

[22]  John D. Storey,et al.  Mapping gene expression quantitative trait loci by singular value decomposition and independent component analysis , 2008, BMC Bioinformatics.

[24]  D. Botstein,et al.  Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Hiroshi Motoda,et al.  Feature Extraction, Construction and Selection: A Data Mining Perspective , 1998 .

[26]  D. Massart,et al.  Sequential projection pursuit using genetic algorithms for data mining of analytical data. , 2000, Analytical chemistry.

[27]  Eric O. Postma,et al.  Dimensionality Reduction: A Comparative Review , 2008 .

[28]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[29]  Christian Posse,et al.  Projection Pursuit Indices Based on the Empirical Distribution Function , 2005 .

[30]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[31]  Soledad Espezua,et al.  Towards an efficient genetic algorithm optimizer for sequential projection pursuit , 2014, Neurocomputing.

[32]  Mancang Liu,et al.  Prediction of ozone tropospheric degradation rate constants by projection pursuit regression. , 2007, Analytica chimica acta.

[33]  LarrañagaPedro,et al.  A review of feature selection techniques in bioinformatics , 2007 .

[34]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[35]  Christopher J. C. Burges,et al.  Dimension Reduction: A Guided Tour , 2010, Found. Trends Mach. Learn..

[36]  Dianne Cook,et al.  Projection Pursuit for Exploratory Supervised Classification , 2005 .

[37]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[38]  Dianne Cook,et al.  A projection pursuit index for large p small n data , 2010, Stat. Comput..

[39]  B. W. Wright,et al.  An improved optimization algorithm and a Bayes factor termination criterion for sequential projection pursuit , 2005 .

[40]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[41]  A. Buja,et al.  Projection Pursuit Indexes Based on Orthonormal Function Expansions , 1993 .

[42]  Alain Berro,et al.  An Efficient Optimization Method for Revealing Local Optima of Projection Pursuit Indices , 2010, ANTS Conference.

[43]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[44]  Subhadip Basu,et al.  Text Line Segmentation for Unconstrained Handwritten Document Images Using Neighborhood Connected Component Analysis , 2009, PReMI.

[45]  J. Friedman,et al.  Projection Pursuit Regression , 1981 .

[46]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[47]  Luis Mateus Rocha,et al.  Singular value decomposition and principal component analysis , 2003 .

[48]  Jason F. Ralph,et al.  Automatic Induction of Projection Pursuit Indices , 2010, IEEE Transactions on Neural Networks.

[49]  Wlodzislaw Duch,et al.  Fast Projection Pursuit Based on Quality of Projected Clusters , 2011, ICANNGA.

[50]  S. Klinke,et al.  Exploratory Projection Pursuit , 1995 .

[51]  Vince D. Calhoun,et al.  A projection pursuit algorithm to classify individuals using fMRI data: Application to schizophrenia , 2008, NeuroImage.

[52]  Virginia Pascual,et al.  An Interferon-Inducible Neutrophil-Driven Blood Transcriptional Signature in Human Tuberculosis , 2010, Nature.

[53]  H.-P. Müller,et al.  Noise reduction in magnetocardiography by singular value decomposition and independent component analysis , 2006, Medical and Biological Engineering and Computing.

[54]  LiuYebin,et al.  The small sample size problem of ICA , 2012 .

[55]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[56]  Debashis Ghosh,et al.  Eigengene-based linear discriminant model for tumor classification using gene expression microarray data , 2006, Bioinform..

[57]  Alain Berro,et al.  Genetic algorithms and particle swarm optimization for exploratory projection pursuit , 2010, Annals of Mathematics and Artificial Intelligence.

[58]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[59]  SoaresCarlos,et al.  Ranking Learning Algorithms , 2003 .

[60]  F. Prieto,et al.  Cluster Identification Using Projections , 2001 .

[61]  J. Kruskal TOWARD A PRACTICAL METHOD WHICH HELPS UNCOVER THE STRUCTURE OF A SET OF MULTIVARIATE OBSERVATIONS BY FINDING THE LINEAR TRANSFORMATION WHICH OPTIMIZES A NEW “INDEX OF CONDENSATION” , 1969 .

[62]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[63]  Wei Yang,et al.  Fast neighborhood component analysis , 2012, Neurocomputing.

[64]  David A. Landgrebe,et al.  Hyperspectral data analysis and supervised feature reduction via projection pursuit , 1999, IEEE Trans. Geosci. Remote. Sens..

[65]  C. Posse Tools for Two-Dimensional Exploratory Projection Pursuit , 1995 .

[66]  I. Johnstone,et al.  Statistical challenges of high-dimensional data , 2009, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[67]  D. Koller,et al.  From signatures to models: understanding cancer using microarrays , 2005, Nature Genetics.

[68]  Ian H. Witten,et al.  Weka: Practical machine learning tools and techniques with Java implementations , 1999 .

[69]  Lai-Wan Chan,et al.  Dimension reduction as a deflation method in ICA , 2006, IEEE Signal Process. Lett..

[70]  J. A. Branco,et al.  Projection-pursuit approach to robust linear discriminant analysis , 2010, J. Multivar. Anal..

[71]  Sushant Sachdeva,et al.  Dimension Reduction , 2008, Encyclopedia of GIS.

[72]  Shunjiu Wang,et al.  Projection Pursuit Dynamic Cluster Model and its Application to Water Resources Carrying Capacity Evaluation , 2010 .

[73]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[74]  Wlodzislaw Duch,et al.  Projection Pursuit Constructive Neural Networks Based on Quality of Projected Clusters , 2008, ICANN.

[75]  Wei Huang,et al.  Projection Pursuit Flood Disaster Classification Assessment Method Based on Multi-Swarm Cooperative Particle Swarm Optimization , 2011 .

[76]  Roslin Russell,et al.  Microarray Technology in Practice , 2008 .

[77]  E. Lander,et al.  MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia , 2002, Nature Genetics.

[78]  Jin Xu,et al.  Detecting single-feature polymorphisms using oligonucleotide arrays and robustified projection pursuit , 2005, Bioinform..

[79]  Wojtek J. Krzanowski,et al.  Projection Pursuit Clustering for Exploratory Data Analysis , 2003 .

[80]  G. Nason,et al.  Design and choice of projection indices , 1992 .

[81]  John W. Tukey,et al.  A Projection Pursuit Algorithm for Exploratory Data Analysis , 1974, IEEE Transactions on Computers.

[82]  Yuxiao Hu,et al.  Face recognition using Laplacianfaces , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.