Wasserstein Distributionally Robust Optimization: Theory and Applications in Machine Learning

Many decision problems in science, engineering and economics are affected by uncertain parameters whose distribution is only indirectly observable through samples. The goal of data-driven decision-making is to learn a decision from finitely many training samples that will perform well on unseen test samples. This learning task is difficult even if all training and test samples are drawn from the same distribution---especially if the dimension of the uncertainty is large relative to the training sample size. Wasserstein distributionally robust optimization seeks data-driven decisions that perform well under the most adverse distribution within a certain Wasserstein distance from a nominal distribution constructed from the training samples. In this tutorial we will argue that this approach has many conceptual and computational benefits. Most prominently, the optimal decisions can often be computed by solving tractable convex optimization problems, and they enjoy rigorous out-of-sample and asymptotic consistency guarantees. We will also show that Wasserstein distributionally robust optimization has interesting ramifications for statistical learning and motivates new approaches for fundamental learning tasks such as classification, regression, maximum likelihood estimation or minimum mean square error estimation, among others.

[1]  Jean-Luc Starck,et al.  Wasserstein Dictionary Learning: Optimal Transport-based unsupervised non-linear dictionary learning , 2017, SIAM J. Imaging Sci..

[2]  D. Kuhn,et al.  Data-Driven Chance Constrained Programs over Wasserstein Balls , 2018, Operations Research.

[3]  M. KarthyekRajhaaA.,et al.  Robust Wasserstein profile inference and applications to machine learning , 2019, J. Appl. Probab..

[4]  Gabriel Peyré,et al.  Regularized Discrete Optimal Transport , 2014, SIAM J. Imaging Sci..

[5]  François-Xavier Vialard,et al.  Scaling algorithms for unbalanced optimal transport problems , 2017, Math. Comput..

[6]  François-Xavier Vialard,et al.  Optimal Transport for Diffeomorphic Registration , 2017, MICCAI.

[7]  Gustavo K. Rohde,et al.  Optimal Mass Transport: Signal processing and machine-learning applications , 2017, IEEE Signal Processing Magazine.

[8]  Jose Blanchet,et al.  Optimal uncertainty size in distributionally robust inverse covariance estimation , 2019, Oper. Res. Lett..

[9]  Viet Anh Nguyen,et al.  Bridging Bayesian and Minimax Mean Square Error Estimation via Wasserstein Distributionally Robust Optimization , 2019, ArXiv.

[10]  P. A. Blight The Analysis of Time Series: An Introduction , 1991 .

[11]  D. Kuhn,et al.  Scenario reduction revisited: fundamental limits and guarantees , 2017, Mathematical Programming.

[12]  Marco Cuturi,et al.  Principal Geodesic Analysis for Probability Measures under the Optimal Transport Metric , 2015, NIPS.

[13]  R. Tyrrell Rockafellar,et al.  Convex Analysis , 1970, Princeton Landmarks in Mathematics and Physics.

[14]  Bernhard Schmitzer,et al.  A Sparse Multiscale Algorithm for Dense Optimal Transport , 2015, Journal of Mathematical Imaging and Vision.

[15]  Benjamin C. Kuo,et al.  AUTOMATIC CONTROL SYSTEMS , 1962, Universum:Technical sciences.

[16]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[17]  Richard O. Michaud,et al.  The Markowitz Optimization Enigma: Is ‘Optimized’ Optimal? , 2005 .

[18]  Tito Homem-de-Mello,et al.  Monte Carlo sampling-based methods for stochastic optimization , 2014 .

[19]  Fan Zhang,et al.  Data-Driven Optimal Transport Cost Selection For Distributionally Robust Optimization , 2017, 2019 Winter Simulation Conference (WSC).

[20]  Sailes K. Sengijpta Fundamentals of Statistical Signal Processing: Estimation Theory , 1995 .

[21]  Dimitris Bertsimas,et al.  A Data-Driven Approach for Multi-Stage Linear Optimization , 2018 .

[22]  Arthur Cayley,et al.  The Collected Mathematical Papers: On Monge's “Mémoire sur la théorie des déblais et des remblais” , 2009 .

[23]  Daniel Kuhn,et al.  Distributionally robust multi-item newsvendor problems with multimodal demand distributions , 2014, Mathematical Programming.

[24]  Daniel Kuhn,et al.  Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations , 2015, Mathematical Programming.

[25]  Julien Rabin,et al.  Convex Histogram-Based Joint Image Segmentation with Regularized Optimal Transport Cost , 2016, Journal of Mathematical Imaging and Vision.

[26]  Arthur F. Kramer,et al.  Discovery and visualization of structural biomarkers from MRI using transport-based morphometry , 2017, NeuroImage.

[27]  Axel Munk,et al.  Limit laws of the empirical Wasserstein distance: Gaussian distributions , 2015, J. Multivar. Anal..

[28]  A. Shapiro Monte Carlo Sampling Methods , 2003 .

[29]  W. Römisch Stability of Stochastic Programming Problems , 2003 .

[30]  Gustavo K. Rohde,et al.  A Transportation Lp\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L^p$$\end{document} Distance for Signal Analysis , 2016, Journal of Mathematical Imaging and Vision.

[31]  G. Pflug,et al.  Ambiguity in portfolio selection , 2007 .

[32]  Lorenzo Rosasco,et al.  Learning Probability Measures with respect to Optimal Transport Metrics , 2012, NIPS.

[33]  Houman Owhadi,et al.  Extreme points of a ball about a measure with finite support , 2015, 1504.06745.

[34]  Richard O. Michaud The Markowitz Optimization Enigma: Is 'Optimized' Optimal? , 1989 .

[35]  David J. Kriegman,et al.  Image to Image Translation for Domain Adaptation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Melvyn Sim,et al.  TRACTABLE ROBUST EXPECTED UTILITY AND RISK MODELS FOR PORTFOLIO OPTIMIZATION , 2009 .

[37]  Dimitri P. Bertsekas,et al.  A new algorithm for the assignment problem , 1981, Math. Program..

[38]  Johan Karlsson,et al.  Generalized Sinkhorn Iterations for Regularizing Inverse Problems Using Optimal Mass Transport , 2016, SIAM J. Imaging Sci..

[39]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[40]  Weijun Xie,et al.  On distributionally robust chance constrained programs with Wasserstein distance , 2018, Mathematical Programming.

[41]  Gabriel Peyré,et al.  Iterative Bregman Projections for Regularized Transportation Problems , 2014, SIAM J. Sci. Comput..

[42]  Yongpei Guan,et al.  Data-driven risk-averse stochastic optimization with Wasserstein metric , 2018, Oper. Res. Lett..

[43]  Yann Gousseau,et al.  Wasserstein Loss for Image Synthesis and Restoration , 2016, SIAM J. Imaging Sci..

[44]  Nicolas Papadakis,et al.  Geodesic PCA versus Log-PCA of Histograms in the Wasserstein Space , 2018, SIAM J. Sci. Comput..

[45]  Laurent El Ghaoui,et al.  Worst-Case Value-At-Risk and Robust Portfolio Optimization: A Conic Programming Approach , 2003, Oper. Res..

[46]  C. Givens,et al.  A class of Wasserstein metrics for probability distributions. , 1984 .

[47]  Daniel Kuhn,et al.  Distributionally robust joint chance constraints with second-order moment information , 2011, Mathematical Programming.

[48]  Jean-Philippe Vial,et al.  Deriving robust counterparts of nonlinear uncertain inequalities , 2012, Math. Program..

[49]  Daniel Kuhn,et al.  Worst-Case Value at Risk of Nonlinear Portfolios , 2013, Manag. Sci..

[50]  Daniel Kuhn,et al.  Distributionally Robust Inverse Covariance Estimation: The Wasserstein Shrinkage Estimator , 2018, Oper. Res..

[51]  Napat Rujeerapaiboon,et al.  Robust Growth-Optimal Portfolios , 2016, Manag. Sci..

[52]  Xi Chen,et al.  Wasserstein Distributional Robustness and Regularization in Statistical Learning , 2017, ArXiv.

[53]  Frank Nielsen,et al.  Tsallis Regularized Optimal Transport and Ecological Inference , 2016, AAAI.

[54]  Michael Werman,et al.  A Linear Time Histogram Metric for Improved SIFT Matching , 2008, ECCV.

[55]  Dimitri P. Bertsekas,et al.  Auction algorithms for network flow problems: A tutorial introduction , 1992, Comput. Optim. Appl..

[56]  R. Brualdi Combinatorial Matrix Classes , 2006 .

[57]  Anthony Man-Cho So,et al.  Linear Matrix Inequalities with Stochastically Dependent Perturbations and Applications to Chance-Constrained Semidefinite Optimization , 2012, SIAM J. Optim..

[58]  J. Stock,et al.  Introduction to Econometrics (3 Rd Updated Edition) , 2014 .

[59]  W. Ziemba,et al.  The Effect of Errors in Means, Variances, and Covariances on Optimal Portfolio Choice , 1993 .

[60]  H. Theil Introduction to econometrics , 1978 .

[61]  Gabriel Peyré,et al.  Fast Dictionary Learning with a Smoothed Wasserstein Loss , 2016, AISTATS.

[62]  Brad Sturt A Data-Driven Approach for Multi-Stage Linear Optimization , 2019 .

[63]  Yinyu Ye,et al.  Distributionally Robust Optimization Under Moment Uncertainty with Application to Data-Driven Problems , 2010, Oper. Res..

[64]  C. Villani Optimal Transport: Old and New , 2008 .

[65]  Daniel Kuhn,et al.  Distributionally Robust Logistic Regression , 2015, NIPS.

[66]  Viet Anh Nguyen,et al.  Wasserstein Distributionally Robust Kalman Filtering , 2018, NeurIPS.

[67]  Daniel Kuhn,et al.  A distributionally robust perspective on uncertainty quantification and chance constrained programming , 2015, Mathematical Programming.

[68]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[69]  X. Nguyen Convergence of latent mixing measures in finite and infinite mixture models , 2011, 1109.3250.

[70]  A. Guillin,et al.  On the rate of convergence in Wasserstein distance of the empirical measure , 2013, 1312.2128.

[71]  Xi Chen,et al.  Wasserstein Distributionally Robust Optimization and Variation Regularization , 2017, Operations Research.

[72]  Nicolas Courty,et al.  Optimal Transport for Domain Adaptation , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[73]  John N. Tsitsiklis,et al.  Introduction to linear optimization , 1997, Athena scientific optimization and computation series.

[74]  Rui Gao Robust Hypothesis Testing Using Wasserstein Uncertainty Sets , 2018 .

[75]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[76]  J. Dupacová Stability and sensitivity-analysis for stochastic programming , 1991 .

[77]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[78]  Werner Römisch,et al.  Stability analysis for stochastic programs , 1991, Ann. Oper. Res..

[79]  Gustavo K. Rohde,et al.  A Transportation Lp Distance for Signal Analysis , 2016, ArXiv.

[80]  Jeffrey M. Woodbridge Econometric Analysis of Cross Section and Panel Data , 2002 .

[81]  Daniel Kuhn,et al.  Regularization via Mass Transportation , 2017, J. Mach. Learn. Res..

[82]  Michael Werman,et al.  Fast and robust Earth Mover's Distances , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[83]  A. Kleywegt,et al.  Distributionally Robust Stochastic Optimization with Wasserstein Distance , 2016, Math. Oper. Res..

[84]  L. Kantorovich On the Translocation of Masses , 2006 .

[85]  D. Dowson,et al.  The Fréchet distance between multivariate normal distributions , 1982 .

[86]  I. Olkin,et al.  The distance between two random vectors with given dispersion matrices , 1982 .

[87]  Gabriel Peyré,et al.  Entropic Approximation of Wasserstein Gradient Flows , 2015, SIAM J. Imaging Sci..

[88]  Robert L. Winkler,et al.  The Optimizer's Curse: Skepticism and Postdecision Surprise in Decision Analysis , 2006, Manag. Sci..

[89]  Karthyek R. A. Murthy,et al.  Quantifying Distributional Model Risk Via Optimal Transport , 2016, Math. Oper. Res..

[90]  Jonathan D. Cryer,et al.  Time Series Analysis , 1986 .

[91]  Gustavo K. Rohde,et al.  Transport-based single frame super resolution of very low resolution face images , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[92]  Nicolas Courty,et al.  Wasserstein discriminant analysis , 2016, Machine Learning.

[93]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[94]  Gabriel Peyré,et al.  Computational Optimal Transport , 2018, Found. Trends Mach. Learn..

[95]  Tommi S. Jaakkola,et al.  Structured Optimal Transport , 2018, AISTATS.

[96]  Gabriel Peyré,et al.  Learning Generative Models with Sinkhorn Divergences , 2017, AISTATS.

[97]  J. Solomon,et al.  Quantum entropic regularization of matrix-valued optimal transport , 2017, European Journal of Applied Mathematics.

[98]  Montacer Essid,et al.  Quadratically-Regularized Optimal Transport on Graphs , 2017, SIAM J. Sci. Comput..

[99]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[100]  G. Pflug,et al.  Multistage Stochastic Optimization , 2014 .

[101]  Daniel Kuhn,et al.  "Dice"-sion-Making Under Uncertainty: When Can a Random Decision Reduce Risk? , 2016, Manag. Sci..

[102]  Leonidas J. Guibas,et al.  Earth mover's distances on discrete surfaces , 2014, ACM Trans. Graph..

[103]  Daniel Kuhn,et al.  From Data to Decisions: Distributionally Robust Optimization is Optimal , 2017, Manag. Sci..

[104]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[105]  Dinh Q. Phung,et al.  Multilevel Clustering via Wasserstein Means , 2017, ICML.

[106]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[107]  David Wozabal,et al.  A framework for optimization under ambiguity , 2012, Ann. Oper. Res..

[108]  Daniel Kuhn,et al.  Data-driven inverse optimization with imperfect information , 2015, Mathematical Programming.

[109]  Heiko Hoffmann,et al.  Sliced Wasserstein Distance for Learning Gaussian Mixture Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[110]  M. Gelbrich On a Formula for the L2 Wasserstein Metric between Measures on Euclidean and Hilbert Spaces , 1990 .

[111]  Gustavo K. Rohde,et al.  An Optimal Transportation Approach for Nuclear Structure-Based Pathology , 2011, IEEE Transactions on Medical Imaging.

[112]  Katsuhiko Ogata,et al.  Modern Control Engineering , 1970 .

[113]  F. Bach,et al.  Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance , 2017, Bernoulli.

[114]  G. Calafiore,et al.  On Distributionally Robust Chance-Constrained Linear Programs , 2006 .

[115]  Dimitri P. Bertsekas,et al.  Network optimization : continuous and discrete models , 1998 .