The projection score - an evaluation criterion for variable subset selection in PCA visualization

BackgroundIn many scientific domains, it is becoming increasingly common to collect high-dimensional data sets, often with an exploratory aim, to generate new and relevant hypotheses. The exploratory perspective often makes statistically guided visualization methods, such as Principal Component Analysis (PCA), the methods of choice. However, the clarity of the obtained visualizations, and thereby the potential to use them to formulate relevant hypotheses, may be confounded by the presence of the many non-informative variables. For microarray data, more easily interpretable visualizations are often obtained by filtering the variable set, for example by removing the variables with the smallest variances or by only including the variables most highly related to a specific response. The resulting visualization may depend heavily on the inclusion criterion, that is, effectively the number of retained variables. To our knowledge, there exists no objective method for determining the optimal inclusion criterion in the context of visualization.ResultsWe present the projection score, which is a straightforward, intuitively appealing measure of the informativeness of a variable subset with respect to PCA visualization. This measure can be universally applied to find suitable inclusion criteria for any type of variable filtering. We apply the presented measure to find optimal variable subsets for different filtering methods in both microarray data sets and synthetic data sets. We note also that the projection score can be applied in general contexts, to compare the informativeness of any variable subsets with respect to visualization by PCA.ConclusionsWe conclude that the projection score provides an easily interpretable and universally applicable measure of the informativeness of a variable subset with respect to visualization by PCA, that can be used to systematically find the most interpretable PCA visualization in practical exploratory analysis.

[1]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[2]  H. Bojar,et al.  Immediate Gene Expression Changes After the First Course of Neoadjuvant Chemotherapy in Patients with Primary Breast Cancer Disease , 2004, Clinical Cancer Research.

[3]  Katrin Hoffmann,et al.  Translating microarray data for diagnostic testing in childhood leukaemia , 2006, BMC Cancer.

[4]  R. Tibshirani,et al.  Prediction by Supervised Principal Components , 2006 .

[5]  A. Nobel,et al.  Finding large average submatrices in high dimensional data , 2009, 0905.1682.

[6]  I. Jolliffe Principal Component Analysis , 2002 .

[7]  Z. Bai METHODOLOGIES IN SPECTRAL ANALYSIS OF LARGE DIMENSIONAL RANDOM MATRICES , A REVIEW , 1999 .

[8]  A. Raftery,et al.  Variable Selection for Model-Based Clustering , 2006 .

[9]  J. Downing,et al.  Classification of pediatric acute lymphoblastic leukemia by gene expression profiling. , 2003, Blood.

[10]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[11]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[12]  W. Krzanowski Selection of Variables to Preserve Multivariate Data Structure, Using Principal Components , 1987 .

[13]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[14]  J. Edward Jackson,et al.  A User's Guide to Principal Components. , 1991 .

[15]  J. Edward Jackson,et al.  A User's Guide to Principal Components: Jackson/User's Guide to Principal Components , 2004 .

[16]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[17]  Christian A. Rees,et al.  Systematic variation in gene expression patterns in human cancer cell lines , 2000, Nature Genetics.

[18]  M. Vannucci,et al.  Bayesian Variable Selection in Clustering High-Dimensional Data , 2005 .

[19]  Stéphane Dray,et al.  On the number of principal components: A test of dimensionality based on measurements of similarity between matrices , 2008, Comput. Stat. Data Anal..

[20]  A. Nobel,et al.  Statistical Significance of Clustering for High-Dimension, Low–Sample Size Data , 2008 .

[21]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[22]  R. Tibshirani,et al.  Regression shrinkage and selection via the lasso: a retrospective , 2011 .

[23]  R. Abseher,et al.  Microarray gene expression profiling of B-cell chronic lymphocytic leukemia subgroups defined by genomic aberrations and VH mutation status. , 2004, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[24]  I. Johnstone On the distribution of the largest eigenvalue in principal components analysis , 2001 .

[25]  I. Johnstone High Dimensional Statistical Inference and Random Matrices , 2006, math/0611589.

[26]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[27]  Trevor Hastie,et al.  Gene Shaving: a new class of clustering methods for expression arrays , 2000 .

[28]  Qi Tian,et al.  Feature selection using principal feature analysis , 2007, ACM Multimedia.

[29]  I. Jolliffe Discarding Variables in a Principal Component Analysis. Ii: Real Data , 1973 .

[30]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[31]  Jianhua Z. Huang,et al.  Sparse principal component analysis via regularized low rank matrix approximation , 2008 .

[32]  Ian T. Jolliffe,et al.  Discarding Variables in a Principal Component Analysis. I: Artificial Data , 1972 .

[33]  Christos Boutsidis,et al.  Unsupervised feature selection for principal components analysis , 2008, KDD.

[34]  Donald A. Jackson,et al.  How many principal components? stopping rules for determining the number of non-trivial axes revisited , 2005, Comput. Stat. Data Anal..

[35]  Michal Linial,et al.  Novel Unsupervised Feature Filtering of Biological Data , 2006, ISMB.

[36]  B. Mecham,et al.  Individual Matrix Metalloproteinases Control Distinct Transcriptional Responses in Airway Epithelial Cells Infected with Pseudomonas aeruginosa , 2007, Infection and Immunity.

[37]  Ash A. Alizadeh,et al.  'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns , 2000, Genome Biology.

[38]  J. E. Jackson A User's Guide to Principal Components , 1991 .

[39]  Jianhua Z. Huang,et al.  Biclustering via Sparse Singular Value Decomposition , 2010, Biometrics.

[40]  D. Basso,et al.  Integration of genomic and gene expression data of childhood ALL without known aberrations identifies subgroups with specific genetic hallmarks , 2009, Genes, chromosomes & cancer.