Guided Projections for Analyzing the Structure of High-Dimensional Data

ABSTRACT A powerful data transformation method named guided projections is proposed creating new possibilities to reveal the group structure of high-dimensional data in the presence of noise variables. Using projections onto a space spanned by a selection of a small number of observations allows measuring the similarity of other observations to the selection based on orthogonal and score distances. Observations are iteratively exchanged from the selection creating a nonrandom sequence of projections, which we call guided projections. In contrast to conventional projection pursuit methods, which typically identify a low-dimensional projection revealing some interesting features contained in the data, guided projections generate a series of projections that serve as a basis not just for diagnostic plots but to directly investigate the group structure in data. Based on simulated data, we identify the strengths and limitations of guided projections in comparison to commonly employed data transformation methods. We further show the relevance of the transformation by applying it to real-world datasets.

[1]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[2]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[3]  N. Altman An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .

[4]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[5]  Mia Hubert,et al.  Fast and robust discriminant analysis , 2004, Comput. Stat. Data Anal..

[6]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[7]  Peter Filzmoser,et al.  Outlier identification in high dimensions , 2008, Comput. Stat. Data Anal..

[8]  Heike Hofmann,et al.  Tourr: An R package for exploring multivariate data with projections , 2011 .

[9]  Peter Filzmoser,et al.  Partial robust M-regression , 2005 .

[10]  L. Hubert,et al.  Quadratic assignment as a general data analysis strategy. , 1976 .

[11]  Mia Hubert,et al.  ROBPCA: A New Approach to Robust Principal Component Analysis , 2005, Technometrics.

[12]  Kenneth Ward Church,et al.  Very sparse random projections , 2006, KDD '06.

[13]  John W. Tukey,et al.  A Projection Pursuit Algorithm for Exploratory Data Analysis , 1974, IEEE Transactions on Computers.

[14]  R. Rocci,et al.  Clustering Curves on a Reduced Subspace , 2012 .

[15]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[16]  Andreas Buja,et al.  Grand tour and projection pursuit , 1995 .

[17]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[18]  Dianne Cook,et al.  A projection pursuit index for large p small n data , 2010, Stat. Comput..

[19]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[20]  L. Hubert,et al.  Measuring the Power of Hierarchical Cluster Analysis , 1975 .

[21]  Bernard Desgraupes Clustering Indices , 2016 .

[22]  A. Buja,et al.  Projection Pursuit Indexes Based on Orthonormal Function Expansions , 1993 .

[23]  A. M. Mathai,et al.  Quadratic forms in random variables : theory and applications , 1992 .

[24]  Andrei Zinovyev,et al.  Principal Manifolds for Data Visualization and Dimension Reduction , 2007 .

[25]  Stéphane Lafon,et al.  Diffusion maps , 2006 .

[26]  Ying-Chao Hung,et al.  Extracting informative variables in the validation of two-group causal relationship , 2013, Comput. Stat..

[27]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[28]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[29]  A. Wilhelm,et al.  Projection-Based Partitioning for Large, High-Dimensional Datasets , 2010 .

[30]  A. Pomerantsev Acceptance areas for multivariate classification derived by projection methods , 2008 .

[31]  J. Leeuw History of Nonlinear Principal Component Analysis , 2013 .