Statistical Inference for Data Adaptive Target Parameters

Abstract Consider one observes n i.i.d. copies of a random variable with a probability distribution that is known to be an element of a particular statistical model. In order to define our statistical target we partition the sample in V equal size sub-samples, and use this partitioning to define V splits in an estimation sample (one of the V subsamples) and corresponding complementary parameter-generating sample. For each of the V parameter-generating samples, we apply an algorithm that maps the sample to a statistical target parameter. We define our sample-split data adaptive statistical target parameter as the average of these V-sample specific target parameters. We present an estimator (and corresponding central limit theorem) of this type of data adaptive target parameter. This general methodology for generating data adaptive target parameters is demonstrated with a number of practical examples that highlight new opportunities for statistical learning from data. This new framework provides a rigorous statistical methodology for both exploratory and confirmatory analysis within the same data. Given that more research is becoming “data-driven”, the theory developed within this paper provides a new impetus for a greater involvement of statistical inference into problems that are being increasingly addressed by clever, yet ad hoc pattern finding methods. To suggest such potential, and to verify the predictions of the theory, extensive simulation studies, along with a data analysis based on adaptively determined intervention rules are shown and give insight into how to structure such an approach. The results show that the data adaptive target parameter approach provides a general framework and resulting methodology for data-driven science.

[1]  R. Govindan,et al.  Biostatistics Primer: What a Clinician Ought to Know Subgroup Analyses , 2010, Journal of thoracic oncology : official publication of the International Association for the Study of Lung Cancer.

[2]  S. Dudoit,et al.  Asymptotics of cross-validated risk estimation in estimator selection and performance assessment , 2005 .

[3]  James M. Robins,et al.  Unified Methods for Censored Longitudinal Data and Causality , 2003 .

[4]  M. J. Laan,et al.  Targeted Learning: Causal Inference for Observational and Experimental Data , 2011 .

[5]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[6]  D R Ragland,et al.  Coronary heart disease mortality in the Western Collaborative Group Study. Follow-up experience of 22 years. , 1988, American journal of epidemiology.

[7]  G. Andrew,et al.  arm: Data Analysis Using Regression and Multilevel/Hierarchical Models , 2014 .

[8]  M. J. van der Laan,et al.  Statistical Applications in Genetics and Molecular Biology Super Learner , 2010 .

[9]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[10]  Brian D. Ripley,et al.  Modern applied statistics with S, 4th Edition , 2002, Statistics and computing.

[11]  Mark J. van der Laan,et al.  Mining with Inference: Data-Adaptive Target Parameters , 2016, Handbook of Big Data.

[12]  J. Marler,et al.  Secondary analysis of clinical trials--a cautionary note. , 2012, Progress in cardiovascular diseases.

[13]  J. Robins,et al.  Marginal Structural Models and Causal Inference in Epidemiology , 2000, Epidemiology.

[14]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[15]  R. Tibshirani,et al.  Generalized Additive Models , 1991 .

[16]  Mona Singh,et al.  Computational solutions for omics data , 2013, Nature Reviews Genetics.

[17]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[18]  Fan Zhang,et al.  Data mining methods in Omics-based biomarker discovery. , 2011, Methods in molecular biology.

[19]  J. Ioannidis Why Most Discovered True Associations Are Inflated , 2008, Epidemiology.

[20]  Donald B. Rubin,et al.  Bayesian Inference for Causal Effects: The Role of Randomization , 1978 .

[21]  M. J. van der Laan,et al.  The International Journal of Biostatistics Targeted Maximum Likelihood Learning , 2011 .

[22]  Toniann Pitassi,et al.  Preserving Statistical Validity in Adaptive Data Analysis , 2014, STOC.

[23]  Maya Petersen,et al.  Computationally efficient confidence intervals for cross-validated area under the ROC curve estimates. , 2015, Electronic journal of statistics.

[24]  M. Laan Efficient estimation in the bivariate censoring model and repairing NPMLE , 1996 .

[25]  William N. Venables,et al.  Modern Applied Statistics with S , 2010 .

[26]  Douglas B. Kell,et al.  Statistical strategies for avoiding false discoveries in metabolomics and related experiments , 2007, Metabolomics.