Deep Neural Networks for Estimation and Inference

We study deep neural networks and their use in semiparametric inference. We establish novel rates of convergence for deep feedforward neural nets. Our new rates are sufficiently fast (in some cases minimax optimal) to allow us to establish valid second-step inference after first-step estimation with deep learning, a result also new to the literature. Our estimation rates and semiparametric inference results handle the current standard architecture: fully connected feedforward neural networks (multi-layer perceptrons), with the now-common rectified linear unit activation function and a depth explicitly diverging with the sample size. We discuss other architectures as well, including fixed-width, very deep networks. We establish nonasymptotic bounds for these deep nets for a general class of nonparametric regression-type loss functions, which includes as special cases least squares, logistic regression, and other generalized linear models. We then apply our theory to develop semiparametric inference, focusing on causal parameters for concreteness, such as treatment effects, expected welfare, and decomposition effects. Inference in many other semiparametric contexts can be readily obtained. We demonstrate the effectiveness of deep learning with a Monte Carlo analysis and an empirical application to direct mail marketing.

[1]  M. Farrell Robust Inference on Average Treatment Effects with Possibly More Covariates than Observations , 2013, 1309.4686.

[2]  J. Robins,et al.  Estimation of Regression Coefficients When Some Regressors are not Always Observed , 1994 .

[3]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[4]  A. Belloni,et al.  SPARSE MODELS AND METHODS FOR OPTIMAL INSTRUMENTS WITH AN APPLICATION TO EMINENT DOMAIN , 2012 .

[5]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[6]  R. Oaxaca Male-Female Wage Differentials in Urban Labor Markets , 1973 .

[7]  E. Kitagawa,et al.  Components of a Difference Between Two Rates , 1955 .

[8]  Xinwei Ma,et al.  Robust Inference Using Inverse Probability Weighting , 2018, Journal of the American Statistical Association.

[9]  C. J. Stone,et al.  Optimal Global Rates of Convergence for Nonparametric Regression , 1982 .

[10]  Tengyuan Liang,et al.  Training Neural Networks as Learning Data-adaptive Kernels: Provable Representation and Approximation Benefits , 2019, Journal of the American Statistical Association.

[11]  K. Hirano,et al.  Asymptotics for Statistical Treatment Rules , 2009 .

[12]  G. Imbens,et al.  Approximate residual balancing: debiased inference of average treatment effects in high dimensions , 2016, 1604.07125.

[13]  Dmitry Yarotsky,et al.  Error bounds for approximations with deep ReLU networks , 2016, Neural Networks.

[14]  Sanjog Misra,et al.  Heterogeneous Treatment Effects and Optimal Targeting Policy Evaluation , 2018 .

[15]  Christian Hansen,et al.  High-dimensional econometrics and regularized GMM , 2018, 1806.01888.

[16]  C. Manski Statistical treatment rules for heterogeneous populations , 2003 .

[17]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[18]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19]  M. J. Laan,et al.  Targeted Learning: Causal Inference for Observational and Experimental Data , 2011 .

[20]  Francis R. Bach,et al.  Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[21]  Amit Daniely,et al.  Depth Separation for Neural Networks , 2017, COLT.

[22]  J. Robins,et al.  Analysis of semiparametric regression models for repeated outcomes in the presence of missing data , 1995 .

[23]  Uri Shalit,et al.  Learning Representations for Counterfactual Inference , 2016, ICML.

[24]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[25]  Guido Imbens,et al.  Using Wasserstein Generative Adversarial Networks for the Design of Monte Carlo Simulations , 2019, Journal of Econometrics.

[26]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[27]  Tengyuan Liang,et al.  How Well Can Generative Adversarial Networks (GAN) Learn Densities: A Nonparametric View , 2017, ArXiv.

[28]  Zhiqiang Tan,et al.  Model-assisted inference for treatment effects using regularized calibrated estimation with high-dimensional data , 2018, The Annals of Statistics.

[29]  Joseph P. Romano On Non‐parametric Testing, the Uniform Behaviour of the t‐test, and Related Problems , 2004 .

[30]  Esther Duflo,et al.  Generic Machine Learning Inference on Heterogenous Treatment Effects in Randomized Experiments , 2017 .

[31]  S. Geer,et al.  On asymptotically optimal confidence regions and tests for high-dimensional models , 2013, 1303.0518.

[32]  Michael Kohler,et al.  Discussion of: “Nonparametric regression using deep neural networks with ReLU activation function” , 2020 .

[33]  Stefan Wager,et al.  Policy Learning With Observational Data , 2017, Econometrica.

[34]  Halbert White,et al.  Artificial Neural Networks: Approximation and Learning Theory , 1992 .

[35]  V. Koltchinskii,et al.  Oracle inequalities in empirical risk minimization and sparse recovery problems , 2011 .

[36]  A. W. van der Vaart,et al.  Semiparametric Minimax Rates. , 2009, Electronic journal of statistics.

[37]  M. Kohler,et al.  On deep learning as a remedy for the curse of dimensionality in nonparametric regression , 2019, The Annals of Statistics.

[38]  A. Belloni,et al.  Square-Root Lasso: Pivotal Recovery of Sparse Signals via Conic Programming , 2011 .

[39]  Victor Chernozhukov,et al.  Inference on Treatment Effects after Selection Amongst High-Dimensional Controls , 2011 .

[40]  Dmitry Yarotsky,et al.  Optimal approximation of continuous functions by very deep ReLU networks , 2018, COLT.

[41]  Halbert White,et al.  Learning in Artificial Neural Networks: A Statistical Perspective , 1989, Neural Computation.

[42]  A. Blinder Wage Discrimination: Reduced Form and Structural Estimates , 1973 .

[43]  J. Robins,et al.  Double/Debiased Machine Learning for Treatment and Structural Parameters , 2017 .

[44]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[45]  Xiao Liu,et al.  Large-Scale Cross-Category Analysis of Consumer Review Content on Sales Conversion Leveraging Deep Learning , 2018, AAAI 2018.

[46]  Surya Ganguli,et al.  On the Expressive Power of Deep Neural Networks , 2016, ICML.

[47]  N. Fortin,et al.  Decomposition Methods in Economics , 2010 .

[48]  Matias D. Cattaneo,et al.  Kernel-Based Semiparametric Estimators: Small Bandwidth Asymptotics and Bootstrap Consistency , 2018 .

[49]  Abbas Mehrabian,et al.  Nearly-tight VC-dimension bounds for piecewise linear neural networks , 2017, COLT.

[50]  V. Koltchinskii Local Rademacher complexities and oracle inequalities in risk minimization , 2006, 0708.0083.

[51]  Christian Hansen,et al.  Targeted Undersmoothing , 2017, 1706.07328.

[52]  Daniel Westreich,et al.  Propensity score estimation: neural networks, support vector machines, decision trees (CART), and meta-classifiers as alternatives to logistic regression. , 2010, Journal of clinical epidemiology.

[53]  James M. Robins,et al.  MINIMAX ESTIMATION OF A FUNCTIONAL ON A STRUCTURED , 2016 .

[54]  Y. Makovoz Random Approximants and Neural Networks , 1996 .

[55]  Stefan Wager,et al.  Estimating Average Treatment Effects: Supplementary Analyses and Remaining Challenges , 2017, 1702.01250.

[56]  Whitney K. Newey,et al.  Cross-fitting and fast remainder rates for semiparametric estimation , 2017, 1801.09138.

[57]  A. Belloni,et al.  Program evaluation and causal inference with high-dimensional data , 2013, 1311.2645.

[58]  Shahar Mendelson,et al.  Learning without Concentration , 2014, COLT.

[59]  Xiaohong Chen Chapter 76 Large Sample Sieve Estimation of Semi-Nonparametric Models , 2007 .

[60]  J. Hahn On the Role of the Propensity Score in Efficient Semiparametric Estimation of Average Treatment Effects , 1998 .

[61]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[62]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[63]  Ohad Shamir,et al.  Depth-Width Tradeoffs in Approximating Natural Functions with Neural Networks , 2016, ICML.

[64]  Shahar Mendelson,et al.  A Few Notes on Statistical Learning Theory , 2002, Machine Learning Summer School.

[65]  A. Tsiatis Semiparametric Theory and Missing Data , 2006 .

[66]  Stefan Wager,et al.  Estimation and Inference of Heterogeneous Treatment Effects using Random Forests , 2015, Journal of the American Statistical Association.

[67]  D. Rubin,et al.  Causal Inference for Statistics, Social, and Biomedical Sciences: A General Method for Estimating Sampling Variances for Standard Estimators for Average Causal Effects , 2015 .

[68]  Johannes Schmidt-Hieber,et al.  Nonparametric regression using deep neural networks with ReLU activation function , 2017, The Annals of Statistics.

[69]  Xiaotong Shen,et al.  Sieve extremum estimates for weakly dependent data , 1998 .

[70]  V. Koltchinskii,et al.  Rademacher Processes and Bounding the Risk of Function Learning , 2004, math/0405338.

[71]  Kevin Leyton-Brown,et al.  Deep IV: A Flexible Approach for Counterfactual Prediction , 2017, ICML.

[72]  Han Hong,et al.  Semiparametric Efficiency in GMM Models of Nonclassical Measurement Errors, Missing Data and Treatment Effects , 2008 .

[73]  Matus Telgarsky,et al.  Benefits of Depth in Neural Networks , 2016, COLT.

[74]  G. Imbens,et al.  Efficient Inference of Average Treatment Effects in High Dimensions via Approximate Residual Balancing , 2016 .

[75]  Ohad Shamir,et al.  Depth Separation in ReLU Networks for Approximating Smooth Non-Linear Functions , 2016, ArXiv.

[76]  Uri Shalit,et al.  Estimating individual treatment effect: generalization bounds and algorithms , 2016, ICML.

[77]  Karthik Sridharan,et al.  Learning with Square Loss: Localization through Offset Rademacher Complexity , 2015, COLT.

[78]  Matias D. Cattaneo,et al.  Econometric Methods for Program Evaluation , 2018, Annual Review of Economics.

[79]  Xiaohong Chen,et al.  Semiparametric efficiency in GMM models with auxiliary data , 2007, 0705.0069.

[80]  Aad van der Vaart,et al.  Higher order influence functions and minimax estimation of nonlinear functionals , 2008, 0805.3040.

[81]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[82]  Matias D. Cattaneo,et al.  Two-Step Estimation and Inference with Possibly Many Included Covariates , 2018, The Review of Economic Studies.

[83]  Veronika Rockova,et al.  Posterior Concentration for Sparse Deep Learning , 2018, NeurIPS.

[84]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[85]  T. Poggio,et al.  Deep vs. shallow networks : An approximation theory perspective , 2016, ArXiv.

[86]  Matias D. Cattaneo,et al.  Efficient semiparametric estimation of multi-valued treatment effects under ignorability , 2010 .

[87]  Tengyuan Liang,et al.  On How Well Generative Adversarial Networks Learn Densities: Nonparametric and Parametric Results , 2018, ArXiv.

[88]  Adel Javanmard,et al.  Confidence intervals and hypothesis testing for high-dimensional regression , 2013, J. Mach. Learn. Res..

[89]  Boris Hanin,et al.  Universal Function Approximation by Deep Neural Nets with Bounded Width and ReLU Activations , 2017, Mathematics.

[90]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[91]  Tymon Słoczyński,et al.  A GENERAL DOUBLE ROBUSTNESS RESULT FOR ESTIMATING AVERAGE TREATMENT EFFECTS , 2014, Econometric Theory.

[92]  J. Robins,et al.  Locally Robust Semiparametric Estimation , 2016, Econometrica.

[93]  Toru Kitagawa,et al.  Who should be Treated? Empirical Welfare Maximization Methods for Treatment Choice , 2015 .

[94]  Matt Taddy,et al.  Heterogeneous Treatment Effects in Digital Experimentation , 2014, 1412.8563.

[95]  Halbert White,et al.  Improved Rates and Asymptotic Normality for Nonparametric Neural Network Estimators , 1999, IEEE Trans. Inf. Theory.

[96]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.