Group-Wise Principal Component Analysis for Exploratory Data Analysis

ABSTRACT In this article, we propose a new framework for matrix factorization based on principal component analysis (PCA) where sparsity is imposed. The structure to impose sparsity is defined in terms of groups of correlated variables found in correlation matrices or maps. The framework is based on three new contributions: an algorithm to identify the groups of variables in correlation maps, a visualization for the resulting groups, and a matrix factorization. Together with a method to compute correlation maps with minimum noise level, referred to as missing-data for exploratory data analysis (MEDA), these three contributions constitute a complete matrix factorization framework. Two real examples are used to illustrate the approach and compare it with PCA, sparse PCA, and structured sparse PCA. Supplementary materials for this article are available online.

[1]  Francis R. Bach,et al.  Structured Variable Selection with Sparsity-Inducing Norms , 2009, J. Mach. Learn. Res..

[2]  Gabriel Maciá-Fernández,et al.  Tackling the Big Data 4 vs for anomaly detection , 2014, 2014 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS).

[3]  P. Padilla,et al.  Least-squares approximation of a space distribution for a given covariance and latent sub-space , 2011 .

[4]  I. Jolliffe,et al.  A Modified Principal Component Technique Based on the LASSO , 2003 .

[5]  Alberto Ferrer,et al.  Framework for regression‐based missing data imputation methods in on‐line MSPC , 2005 .

[6]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[7]  I. Jolliffe Principal Component Analysis , 2005 .

[8]  P. A. Taylor,et al.  Missing data methods in PCA and PLS: Score calculations with incomplete observations , 1996 .

[9]  I. Jolliffe Rotation of principal components: choice of normalization constraints , 1995 .

[10]  Age K Smilde,et al.  Estimating confidence intervals for principal component loadings: a comparison between the bootstrap and asymptotic results. , 2007, The British journal of mathematical and statistical psychology.

[11]  R. Krska,et al.  GC–MS based targeted metabolic profiling identifies changes in the wheat metabolome following deoxynivalenol treatment , 2014, Metabolomics.

[12]  Age K. Smilde,et al.  Simplivariate Models: Uncovering the Underlying Biology in Functional Genomics Data , 2011, PloS one.

[13]  Rasmus Larsen,et al.  SpaSM: A MATLAB Toolbox for Sparse Statistical Modeling , 2018 .

[14]  José Camacho,et al.  Multivariate Exploratory Data Analysis (MEDA) Toolbox for Matlab , 2015 .

[15]  Lester W. Mackey,et al.  Deflation Methods for Sparse PCA , 2008, NIPS.

[16]  Duane T. Wegener,et al.  Evaluating the use of exploratory factor analysis in psychological research. , 1999 .

[17]  José Camacho,et al.  Missing-data theory in the context of exploratory data analysis , 2010 .

[18]  S. Joe Qin,et al.  Analysis and generalization of fault diagnosis methods for process monitoring , 2011 .

[19]  Rasmus Bro,et al.  A tutorial on the Lasso approach to sparse modeling , 2012 .

[20]  S. Wold Cross-Validatory Estimation of the Number of Components in Factor and Principal Components Models , 1978 .

[21]  José Camacho,et al.  Observation‐based missing data methods for exploratory data analysis to unveil the connection between observations and variables in latent subspace models , 2011 .

[22]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[23]  Jean-Philippe Vert,et al.  Group lasso with overlap and graph lasso , 2009, ICML '09.

[24]  M. Browne An Overview of Analytic Rotation in Exploratory Factor Analysis , 2001 .

[25]  Gabriel Maciá-Fernández,et al.  Hierarchical PCA-based multivariate statistical network monitoring for anomaly detection , 2016, 2016 IEEE International Workshop on Information Forensics and Security (WIFS).

[26]  Francis R. Bach,et al.  Consistency of the group Lasso and multiple kernel learning , 2007, J. Mach. Learn. Res..

[27]  Chih-Jen Lin,et al.  Projected Gradient Methods for Nonnegative Matrix Factorization , 2007, Neural Computation.

[28]  José Camacho,et al.  On the use of the observation‐wise k‐fold operation in PCA cross‐validation , 2015 .

[29]  R. Tibshirani,et al.  Regression shrinkage and selection via the lasso: a retrospective , 2011 .

[30]  John F. MacGregor,et al.  Multivariate SPC charts for monitoring batch processes , 1995 .

[31]  Francis R. Bach,et al.  Structured Sparse Principal Component Analysis , 2009, AISTATS.

[32]  N. Cliff Orthogonal rotation to congruence , 1966 .

[33]  Ma Hopkins,et al.  Missing Data Methods , 2015 .

[34]  A. Ferrer,et al.  Dealing with missing data in MSPC: several methods, different interpretations, some examples , 2002 .

[35]  Romà Tauler,et al.  Multivariate Curve Resolution (MCR) from 2000: Progress in Concepts and Applications , 2006 .

[36]  H. Kaiser The varimax criterion for analytic rotation in factor analysis , 1958 .

[37]  J. Edward Jackson,et al.  A User's Guide to Principal Components. , 1991 .

[38]  G. A. Ferguson,et al.  A general rotation criterion and its use in orthogonal rotation , 1970 .

[39]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[40]  Ping Zhang Model Selection Via Multifold Cross Validation , 1993 .

[41]  Age K. Smilde,et al.  Tracy–Widom statistic for the largest eigenvalue of autoscaled real matrices , 2011 .

[42]  Jason W. Osborne,et al.  Best practices in exploratory factor analysis: four recommendations for getting the most from your analysis. , 2005 .