Principal Component Analysis of Process Datasets with Missing Values

Datasets with missing values arising from causes such as sensor failure, inconsistent sampling rates, and merging data from different systems are common in the process industry. Methods for handling missing data typically operate during data pre-processing, but can also occur during model building. This article considers missing data within the context of principal component analysis (PCA), which is a method originally developed for complete data that has widespread industrial application in multivariate statistical process control. Due to the prevalence of missing data and the success of PCA for handling complete data, several PCA algorithms that can act on incomplete data have been proposed. Here, algorithms for applying PCA to datasets with missing values are reviewed. A case study is presented to demonstrate the performance of the algorithms and suggestions are made with respect to choosing which algorithm is most appropriate for particular settings. An alternating algorithm based on the singular value decomposition achieved the best results in the majority of test cases involving process datasets.

[1]  Sirish L. Shah,et al.  Treatment of missing values in process data analysis , 2008 .

[2]  Anurag S Rathore,et al.  Application of Multivariate Analysis toward Biotech Processes: Case Study of a Cell‐Culture Unit Operation , 2007, Biotechnology progress.

[3]  John F. MacGregor,et al.  Multivariate SPC charts for monitoring batch processes , 1995 .

[4]  J. Schafer Multiple imputation: a primer , 1999, Statistical methods in medical research.

[5]  E. F. Vogel,et al.  A plant-wide industrial process control problem , 1993 .

[6]  Pedro M. Saraiva,et al.  Heteroscedastic latent variable modelling with applications to multivariate statistical process control , 2006 .

[7]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[8]  Jialin Liu,et al.  On-line soft sensor for polyethylene process with multiple production grades , 2007 .

[9]  Thomas E. Marlin,et al.  Multivariate statistical monitoring of process operating performance , 1991 .

[10]  S. Joe Qin,et al.  Process data analytics in the era of big data , 2014 .

[11]  Christos Georgakis,et al.  Disturbance detection and isolation by dynamic principal component analysis , 1995 .

[12]  Peter D. Wentzell,et al.  Applications of maximum likelihood principal component analysis: incomplete data sets and calibration transfer , 1997 .

[13]  Richard D. Braatz,et al.  Data-driven Methods for Fault Detection and Diagnosis in Chemical Processes , 2000 .

[14]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[15]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[16]  P. Wedin On angles between subspaces of a finite dimensional inner product space , 1983 .

[17]  R. Cattell The Scree Test For The Number Of Factors. , 1966, Multivariate behavioral research.

[18]  Robin Parker,et al.  Missing Data Problems in Machine Learning , 2010 .

[19]  Shin Ishii,et al.  A Bayesian missing value estimation method for gene expression profile data , 2003, Bioinform..

[20]  Paul Geladi,et al.  Principal Component Analysis , 1987, Comprehensive Chemometrics.

[21]  Darren T. Andrews,et al.  Maximum likelihood principal component analysis , 1997 .

[22]  Zhixun Su,et al.  Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation , 2011, NIPS.

[23]  D. Donoho,et al.  The Optimal Hard Threshold for Singular Values is 4 / √ 3 , 2013 .

[24]  J. Horn A rationale and test for the number of factors in factor analysis , 1965, Psychometrika.

[25]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[26]  R. Manne,et al.  Missing values in principal component analysis , 1998 .

[27]  Gene H. Golub,et al.  Numerical methods for computing angles between linear subspaces , 1971, Milestones in Matrix Computation.

[28]  Emmanuel J. Candès,et al.  A Singular Value Thresholding Algorithm for Matrix Completion , 2008, SIAM J. Optim..

[29]  Andrew W. Dorsey,et al.  Monitoring of batch processes through state‐space models , 2004 .

[30]  Lingbo Yu,et al.  Probabilistic principal component analysis with expectation maximization (PPCA-EM) facilitates volume classification and estimates the missing data. , 2010, Journal of structural biology.

[31]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[32]  Tapani Raiko,et al.  Tkk Reports in Information and Computer Science Practical Approaches to Principal Component Analysis in the Presence of Missing Values Tkk Reports in Information and Computer Science Practical Approaches to Principal Component Analysis in the Presence of Missing Values , 2022 .

[33]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[34]  J. Macgregor,et al.  Monitoring batch processes using multiway principal component analysis , 1994 .

[35]  D. Massart,et al.  Dealing with missing data: Part II , 2001 .

[36]  J. E. Jackson,et al.  Control Procedures for Residuals Associated With Principal Component Analysis , 1979 .

[37]  Sam T. Roweis,et al.  EM Algorithms for PCA and SPCA , 1997, NIPS.

[38]  Charles M. Bishop Variational principal components , 1999 .

[39]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[40]  Honglu Yu,et al.  Multivariate image analysis and regression for prediction of coating content and distribution in the production of snack foods , 2003 .

[41]  John F. MacGregor STATISTICAL PROCESS CONTROL OF MULTIVARIATE PROCESSES , 1994 .

[42]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[43]  Christos Georgakis,et al.  Plant-wide control of the Tennessee Eastman problem , 1995 .

[44]  P. A. Taylor,et al.  Missing data methods in PCA and PLS: Score calculations with incomplete observations , 1996 .

[45]  Thomas F. Edgar,et al.  Identification of faulty sensors using principal component analysis , 1996 .