A Diagnostic Procedure for High-Dimensional Data Streams via Missed Discovery Rate Control

Abstract Monitoring complex systems involving high-dimensional data streams (HDS) provides quick real-time detection of abnormal changes of system performance, but accurate and efficient diagnosis of the streams responsible has also become increasingly important in many data-rich statistical process control applications. Existing diagnostic procedures, designed for low/moderate dimensional multivariate process, may miss too much important information in the out-of-control streams with a high signal-to-noise ratio (SNR) or waste too many resources finding useless in-control streams with a low SNR. In addition, these procedures do not differentiate between streams according to their severity. In this article, we formulate the diagnosis problem of HDS as a multiple testing problem and provide a computationally fast diagnostic procedure to control the weighted missed discovery rate (wMDR) at some satisfactory level. The proposed procedure overcomes the limitations of conventional diagnostic procedures by controlling the wMDR and minimizing the expected number of false positives as well. We show theoretically that the proposed procedure is asymptotically valid and optimal in a certain sense. Simulation studies and a real data analysis from a semiconductor manufacturing process show that the proposed procedure works very well in practice.

[1]  Peihua Qiu,et al.  Multivariate Statistical Process Control Using LASSO , 2009 .

[2]  T. Cai,et al.  Estimating the Null and the Proportion of Nonnull Effects in Large-Scale Multiple Comparisons , 2006, math/0611108.

[3]  Fugee Tsung,et al.  Applying manufacturing batch techniques to fraud detection with incomplete customer information , 2007 .

[4]  Wei Jiang,et al.  A statistical process control approach to business activity monitoring , 2007 .

[5]  J. Booth,et al.  Resampling-Based Multiple Testing. , 1994 .

[6]  Fugee Tsung,et al.  Monitoring General Linear Profiles Using Multivariate Exponentially Weighted Moving Average Schemes , 2007, Technometrics.

[7]  Aurore Delaigle,et al.  On optimal kernel choice for deconvolution , 2006 .

[8]  Dongdong Xiang,et al.  A Robust Multivariate EWMA Control Chart for Detecting Sparse Mean Shifts , 2016 .

[9]  Changliang Zou,et al.  On-line Control of False Discovery Rates for Multiple Datastreams , 2018 .

[10]  D. Hawkins Multivariate quality control based on regression-adjusted variables , 1991 .

[11]  Jing Li,et al.  Causation-based T 2 decomposition for multivariate process monitoring and diagnosis , 2006 .

[12]  Costas J. Spanos,et al.  Fundamentals of Semiconductor Manufacturing and Process Control: May/Fundamentals of Semiconductor Manufacturing and Process Control , 2006 .

[13]  John C. Young,et al.  A Practical Approach for Interpreting Multivariate T2 Control Chart Signals , 1997 .

[14]  Hongzhe Li,et al.  Optimal False Discovery Rate Control for Dependent Data. , 2011, Statistics and its interface.

[15]  Kaibo Liu,et al.  A nonparametric adaptive sampling strategy for online monitoring of big data streams , 2017, 2017 13th IEEE Conference on Automation Science and Engineering (CASE).

[16]  David Siegmund,et al.  Sequential multi-sensor change-point detection , 2013, 2013 Information Theory and Applications Workshop (ITA).

[17]  D. Siegmund Detecting Simultaneous Change-points in Multiple Sequences , 2008 .

[18]  Dongdong Xiang,et al.  A robust self-starting spatial rank multivariate EWMA chart based on forward variable selection , 2017, Comput. Ind. Eng..

[19]  Irène Gijbels,et al.  Practical bandwidth selection in deconvolution kernel density estimation , 2004, Comput. Stat. Data Anal..

[20]  Jianjun Shi,et al.  Causation-Based T2 Decomposition for Multivariate Process Monitoring and Diagnosis , 2008 .

[21]  Joseph J. Pignatiello,et al.  Estimation of the Change Point of a Normal Process Mean in SPC Applications , 2001 .

[22]  M. Newton Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis , 2008 .

[23]  Wenguang Sun,et al.  False discovery control in large‐scale spatial multiple testing , 2015, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[24]  Wenguang Sun,et al.  Oracle and Adaptive Compound Decision Rules for False Discovery Rate Control , 2007 .

[25]  Wei Jiang,et al.  A LASSO-Based Diagnostic Framework for Multivariate Statistical Process Control , 2011, Technometrics.

[26]  Hongzhe Li,et al.  Correction to the paper “Optimal False Discovery Rate Control for Dependent Data” , 2016 .

[27]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[28]  Eric Boerwinkle,et al.  A weighted false discovery rate control procedure reveals alleles at FOXA2 that influence fasting glucose levels. , 2010, American journal of human genetics.

[29]  Tze Leung Lai,et al.  Multiple Testing in Regression Models With Applications to Fault Diagnosis in the Big Data Era , 2017, Technometrics.

[30]  Peihua Qiu,et al.  Nonparametric Profile Monitoring by Mixed Effects Modeling , 2010, Technometrics.

[31]  Y. Benjamini,et al.  False Discovery Rates for Spatial Signals , 2007 .

[32]  Alexander C. McLain,et al.  Multiple Testing of Composite Null Hypotheses in Heteroscedastic Models , 2012 .

[33]  Wei Jiang,et al.  High-Dimensional Process Monitoring and Fault Isolation via Variable Selection , 2009 .

[34]  Zhonghua Li,et al.  On-line monitoring data quality of high-dimensional data streams , 2016 .

[35]  Wenguang Sun,et al.  Large‐scale multiple testing under dependence , 2009 .

[36]  Hao Yan,et al.  Real-Time Monitoring of High-Dimensional Functional Data Streams via Spatio-Temporal Smooth Sparse Decomposition , 2018, Technometrics.

[37]  Venugopal V. Veeravalli Decentralized quickest change detection , 2001, IEEE Trans. Inf. Theory.

[38]  Nancy R. Zhang,et al.  Detecting simultaneous changepoints in multiple sequences. , 2010, Biometrika.

[39]  Trilce Estrada,et al.  On the Powerful Use of Simulations in the Quake-Catcher Network to Efficiently Position Low-cost Earthquake Sensors , 2011, 2011 IEEE Seventh International Conference on eScience.

[40]  Nola D. Tracy,et al.  Decomposition of T2 for Multivariate Control Chart Interpretation , 1995 .

[41]  Wenguang Sun,et al.  Simultaneous Testing of Grouped Hypotheses: Finding Needles in Multiple Haystacks , 2009 .

[42]  Wei Jiang,et al.  An Efficient Online Monitoring Method for High-Dimensional Data Streams , 2015, Technometrics.

[43]  Douglas M. Hawkins,et al.  A Multivariate Change-Point Model for Statistical Process Control , 2006, Technometrics.

[44]  Yajun Mei,et al.  An Adaptive Sampling Strategy for Online High-Dimensional Process Monitoring , 2015, Technometrics.

[45]  Robert L. Mason,et al.  Step-Down Analysis for Changes in the Covariance Matrix and Other Parameters , 2007 .

[46]  Dionys Van Gemert Monitoring and diagnosis , 2006 .

[47]  Changliang Zou,et al.  Nonparametric Profile Monitoring , 2011 .

[48]  Giovanna Capizzi,et al.  A Least Angle Regression Control Chart for Multidimensional Data , 2011, Technometrics.

[49]  L. Wasserman,et al.  Operating characteristics and extensions of the false discovery rate procedure , 2002 .

[50]  Y. Mei Efficient scalable schemes for monitoring a large number of data streams , 2010 .

[51]  Wei Jiang,et al.  An adaptive T 2 chart for multivariate process monitoring and diagnosis , 2009 .

[52]  S. Zeger,et al.  A Smooth Nonparametric Estimate of a Mixing Distribution Using Mixtures of Gaussians , 1996 .

[53]  R. Carroll,et al.  Nonparametric Function Estimation for Clustered Data When the Predictor is Measured without/with Error , 2000 .