Tests and variables selection on regression analysis for massive datasets

According to Lindley's paradox, most point null hypotheses will be rejected when the sample size is too large. In this paper, a two-stage block testing procedure is proposed for massive data regression analysis. New variables selection criteria incorporating with classical stepwise procedure are also developed to select significant explanatory variables. Our approach is not only simple in computation for massive data but also confirmed by the simulation study that our approach is more accurate in the sense of achieving the nominal significance level for huge data sets. A real example with moderate sample size verifies that the proposed procedure is accurate compared with the classical method, and a huge real data set is also demonstrated to select appropriate regressors.

[1]  Padhraic Smyth,et al.  Statistical Themes and Lessons for Data Mining , 2004, Data Mining and Knowledge Discovery.

[2]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[3]  David Madigan,et al.  A Sequential Monte Carlo Method for Bayesian Analysis of Massive Datasets , 2003, Data Mining and Knowledge Discovery.

[4]  G. Casella,et al.  Statistical Inference , 2003, Encyclopedia of Social Network Analysis and Mining.

[5]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[6]  D. Lindley A STATISTICAL PARADOX , 1957 .

[7]  Frank Kleibergen Testing Subsets of Structural Parameters in the Instrumental Variables , 2004, Review of Economics and Statistics.

[8]  Daryl Pregibon,et al.  A Statistical Perspective on Knowledge Discovery in Databases , 1996, Advances in Knowledge Discovery and Data Mining.

[9]  David J. Hand,et al.  Data Mining: Statistics and More? , 1998 .

[10]  Tsai-Hung Fan,et al.  Regression analysis for massive datasets , 2007, Data Knowl. Eng..

[11]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[12]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[13]  N. Draper,et al.  Applied Regression Analysis , 1967 .

[14]  Theis Rauhe,et al.  Optimal time-space trade-offs for sorting , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[15]  Frank Kleibergen,et al.  Invariant Bayesian inference in regression models that is robust against the Jeffreys–Lindley's paradox , 2004 .

[16]  Hava T. Siegelmann,et al.  Support Vector Clustering , 2002, J. Mach. Learn. Res..

[17]  Niall M. Adams,et al.  Data Mining for Fun and Profit , 2000 .

[18]  D. Madigan,et al.  A one-pass sequential Monte Carlo method for Bayesian analysis of massive datasets , 2006 .

[19]  Michael J. A. Berry,et al.  Mastering Data Mining: The Art and Science of Customer Relationship Management , 1999 .

[20]  Hamparsum Bozdogan,et al.  Statistical Data Mining and Knowledge Discovery , 2004 .

[21]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[22]  Padhraic Smyth,et al.  Anytime Exploratory Data Analysis for Massive Data Sets , 1997, KDD.

[23]  Linda Trocine,et al.  Data Mining and Traditional Regression , 2003 .