Online updating method to correct for measurement error in big data streams

Abstract When huge amounts of data arrive in streams, online updating is an important method to alleviate both computational and data storage issues. The scope of previous research for online updating is extended in the context of the classical linear measurement error model. In the case where some covariates are unknowingly measured with error at the beginning of the stream, but then are measured without error after a particular point along the data stream, the updated estimators ignoring the measurement error are biased for the true parameters. Once the covariates measured without error are first observed, a method to correct the bias of the estimators, as well as to correct the biases in their variance estimator, is proposed; after correction, the traditional online updating method can then proceed as usual. Further, asymptotic distributions for the corrected and updated estimators are established. Simulation studies and a real data analysis with an airline on-time dataset are provided to illustrate the performance of the proposed method.

[1]  Mitchell H. Gail,et al.  Case-Control Studies With Errors in Covariates , 1993 .

[2]  Raymond J. Carroll,et al.  Covariate Measurement Error in Logistic Regression , 1985 .

[3]  Wei Zhang,et al.  An Inexpensive, Stable, and Accurate Relative Humidity Measurement Method for Challenging Environments , 2016, Sensors.

[4]  Raymond J Carroll,et al.  Linear Model Selection When Covariates Contain Errors , 2017, Journal of the American Statistical Association.

[5]  D. Ruppert,et al.  The Use and Misuse of Orthogonal Regression in Linear Errors-in-Variables Models , 1996 .

[6]  Raymond J. Carroll,et al.  Measurement error in nonlinear models: a modern perspective , 2006 .

[7]  Liqun Wang ESTIMATION OF NONLINEAR BERKSON-TYPE MEASUREMENT ERROR MODELS , 2003 .

[8]  Andy P. Field,et al.  Discovering Statistics Using SPSS , 2000 .

[9]  Guohua Zou,et al.  Adaptive LASSO for varying-coefficient partially linear measurement error models , 2013 .

[10]  HaiYing Wang,et al.  An Online Updating Approach for Testing the Proportional Hazards Assumption with Streams of Big Survival Data. , 2018, 1809.01291.

[11]  Jing Wu,et al.  Online Updating of Statistical Inference in the Big Data Setting , 2015, Technometrics.

[12]  Nancy Flournoy,et al.  The focused information criterion for varying-coefficient partially linear measurement error models , 2016 .

[13]  Rong Zhu,et al.  Optimal Subsampling for Large Sample Logistic Regression , 2017, Journal of the American Statistical Association.

[14]  S Senn Covariance analysis in generalized linear measurement error models. , 1990, Statistics in medicine.

[15]  Ruibin Xi,et al.  Aggregated estimating equation estimation , 2011 .

[16]  F. Liang,et al.  A split‐and‐merge Bayesian variable selection approach for ultrahigh dimensional regression , 2015 .

[17]  Purnamrita Sarkar,et al.  A scalable bootstrap for massive data , 2011, 1112.5016.

[18]  Ming-Hui Chen,et al.  Statistical methods and computing for big data. , 2015, Statistics and its interface.

[19]  Minge Xie,et al.  A Split-and-Conquer Approach for Analysis of Extraordinarily Large Data , 2014 .

[20]  Guohua Zou,et al.  Model averaging for varying-coefficient partially linear measurement error models , 2012 .

[21]  Hua Liang,et al.  Generalized Partially Linear Measurement Error Models , 2005 .

[22]  Luigi Fortuna,et al.  An Improved Instrument for Real-Time Measurement of Blood Flow Velocity in Microvessels , 2007, IEEE Transactions on Instrumentation and Measurement.

[23]  Ping Ma,et al.  A statistical perspective on algorithmic leveraging , 2013, J. Mach. Learn. Res..

[24]  Ming-Hui Chen,et al.  Online updating method with new variables for big data streams , 2018, The Canadian journal of statistics = Revue canadienne de statistique.

[25]  David Ruppert,et al.  Additive Partial Linear Models with Measurement Errors. , 2008, Biometrika.

[26]  Liqun Wang Estimation of nonlinear models with Berkson measurement errors , 2004 .

[27]  Raymond J. Carroll,et al.  Measurement Error Regression with Unknown Link: Dimension Reduction and Data Visualization , 1992 .

[28]  Min Yang,et al.  Information-Based Optimal Subdata Selection for Big Data Linear Regression , 2017, Journal of the American Statistical Association.

[29]  Alexander Kukush,et al.  Measurement Error Models , 2011, International Encyclopedia of Statistical Science.

[30]  Raymond J. Carroll,et al.  Conditional scores and optimal scores for generalized linear measurement-error models , 1987 .