Distributionally Robust Parametric Maximum Likelihood Estimation

We consider the parameter estimation problem of a probabilistic generative model prescribed using a natural exponential family of distributions. For this problem, the typical maximum likelihood estimator usually overfits under limited training sample size, is sensitive to noise and may perform poorly on downstream predictive tasks. To mitigate these issues, we propose a distributionally robust maximum likelihood estimator that minimizes the worst-case expected log-loss uniformly over a parametric Kullback-Leibler ball around a parametric nominal distribution. Leveraging the analytical expression of the Kullback-Leibler divergence between two distributions in the same natural exponential family, we show that the min-max estimation problem is tractable in a broad setting, including the robust training of generalized linear models. Our novel robust estimator also enjoys statistical consistency and delivers promising empirical results in both regression and classification tasks.

[1]  Zhaolin Hu,et al.  Kullback-Leibler divergence constrained distributionally robust optimization , 2012 .

[2]  Joseph M. Hilbe,et al.  Modeling Count Data , 2014, International Encyclopedia of Statistical Science.

[3]  Maarten van Smeden,et al.  Sample size considerations and predictive performance of multinomial logistic prediction models , 2019, Statistics in medicine.

[4]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[5]  Honglak Lee,et al.  Efficient L1 Regularized Logistic Regression , 2006, AAAI.

[6]  Daniel Kuhn,et al.  Distributionally Robust Logistic Regression , 2015, NIPS.

[7]  Elena Smirnova,et al.  Distributionally Robust Counterfactual Risk Minimization , 2019, AAAI.

[8]  Zhengyuan Zhou,et al.  Distributionally Robust Policy Evaluation and Learning in Offline Contextual Bandits , 2020, ICML.

[9]  O. Barndorff-Nielsen Information And Exponential Families , 1970 .

[10]  F. Opitz Information geometry and its applications , 2012, 2012 9th European Radar Conference.

[11]  Daniel Kuhn,et al.  Distributionally Robust Inverse Covariance Estimation: The Wasserstein Shrinkage Estimator , 2018, Oper. Res..

[12]  Sida I. Wang,et al.  Dropout Training as Adaptive Regularization , 2013, NIPS.

[13]  R. Prentice,et al.  Commentary on Andersen and Gill's "Cox's Regression Model for Counting Processes: A Large Sample Study" , 1982 .

[14]  Vishal Gupta,et al.  Data-driven robust optimization , 2013, Math. Program..

[15]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[16]  Mia Hubert,et al.  Robust statistics for outlier detection , 2011, WIREs Data Mining Knowl. Discov..

[17]  Heinz H. Bauschke,et al.  Legendre functions and the method of random Bregman projections , 1997 .

[18]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[19]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[20]  Stephen P. Boyd,et al.  ECOS: An SOCP solver for embedded systems , 2013, 2013 European Control Conference (ECC).

[21]  Viet Anh Nguyen,et al.  Wasserstein Distributionally Robust Optimization: Theory and Applications in Machine Learning , 2019, Operations Research & Management Science in the Age of Analytics.

[22]  Adrian S. Lewis,et al.  Convex Analysis And Nonlinear Optimization , 2000 .

[23]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[24]  Karthyek Murthy,et al.  Optimal Transport-Based Distributionally Robust Optimization: Structural Properties and Iterative Schemes , 2018, Math. Oper. Res..

[25]  John Duchi,et al.  Statistics of Robust Optimization: A Generalized Empirical Likelihood Approach , 2016, Math. Oper. Res..

[26]  Fan Zhang,et al.  A Distributionally Robust Boosting Algorithm , 2019, 2019 Winter Simulation Conference (WSC).

[27]  Panos M. Pardalos,et al.  Convex optimization theory , 2010, Optim. Methods Softw..

[28]  C. Berge Topological Spaces: including a treatment of multi-valued functions , 2010 .

[29]  Kim C. Border,et al.  Infinite Dimensional Analysis: A Hitchhiker’s Guide , 1994 .

[30]  Daniel Kuhn,et al.  Regularization via Mass Transportation , 2017, J. Mach. Learn. Res..

[31]  M. Sion On general minimax theorems , 1958 .

[32]  Anja De Waegenaere,et al.  Robust Solutions of Optimization Problems Affected by Uncertain Probabilities , 2011, Manag. Sci..

[33]  Christopher D. Manning,et al.  Fast dropout training , 2013, ICML.

[34]  Kewei Cheng,et al.  Feature Selection , 2016, ACM Comput. Surv..

[35]  Jason Eisner,et al.  Log-Linear Models , 2017, Encyclopedia of Machine Learning and Data Mining.

[36]  David B. Dunson,et al.  Comparing and Weighting Imperfect Models Using D-Probabilities , 2016, Journal of the American Statistical Association.

[37]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[38]  R. Gray Entropy and Information Theory , 1990, Springer New York.

[39]  Eric R. Ziegel,et al.  An Introduction to Generalized Linear Models , 2002, Technometrics.

[40]  N. Bingham Probability Theory: An Analytic View , 2002 .

[41]  John C. Duchi,et al.  Variance-based Regularization with Convex Objectives , 2016, NIPS.

[42]  R. Rockafellar,et al.  Local strong convexity and local Lipschitz continuity of the gradient of convex functions , 2007 .

[43]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[44]  John C. Duchi,et al.  Stochastic Gradient Methods for Distributionally Robust Optimization with f-divergences , 2016, NIPS.