Differentiable Ranking and Sorting using Optimal Transport

Sorting an array is a fundamental routine in machine learning, one that is used to compute rank-based statistics, cumulative distribution functions (CDFs), quantiles, or to select closest neighbors and labels. The sorting function is however piece-wise constant (the sorting permutation of a vector does not change if the entries of that vector are infinitesimally perturbed) and therefore has no gradient information to back-propagate. We propose a framework to sort elements that is algorithmically differentiable. We leverage the fact that sorting can be seen as a particular instance of the optimal transport (OT) problem on $\mathbb{R}$, from input values to a predefined array of sorted values (e.g. $1,2,\dots,n$ if the input array has $n$ elements). Building upon this link , we propose generalized CDFs and quantile operators by varying the size and weights of the target presorted array. Because this amounts to using the so-called Kantorovich formulation of OT, we call these quantities K-sorts, K-CDFs and K-quantiles. We recover differentiable algorithms by adding to the OT problem an entropic regularization, and approximate it using a few Sinkhorn iterations. We call these operators S-sorts, S-CDFs and S-quantiles, and use them in various learning settings: we benchmark them against the recently proposed neuralsort [Grover et al. 2019], propose applications to quantile regression and introduce differentiable formulations of the top-k accuracy that deliver state-of-the art performance.

[1]  Carlos Eduardo Scheidegger,et al.  Certifying and Removing Disparate Impact , 2014, KDD.

[2]  Tao Qin,et al.  A general approximation framework for direct optimization of information retrieval measures , 2010, Information Retrieval.

[3]  J. Lorenz,et al.  On the scaling of multidimensional matrices , 1989 .

[4]  Yann Brenier,et al.  Rearrangement, convection, convexity and entropy , 2013, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[5]  Arnaud Doucet,et al.  Fast Computation of Wasserstein Barycenters , 2013, ICML.

[6]  Julien Rabin,et al.  Wasserstein Barycenter and Its Application to Texture Mixing , 2011, SSVM.

[7]  Filippo Santambrogio,et al.  Optimal Transport for Applied Mathematicians , 2015 .

[8]  G. Lugosi,et al.  Regularization, sparse recovery, and median-of-means tournaments , 2017, Bernoulli.

[9]  Tommi S. Jaakkola,et al.  Learning Population-Level Diffusions with Generative RNNs , 2016, ICML.

[10]  Matthieu Lerasle,et al.  ROBUST MACHINE LEARNING BY MEDIAN-OF-MEANS: THEORY AND PRACTICE , 2019 .

[11]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[12]  Stephen E. Robertson,et al.  SoftRank: optimizing non-smooth rank metrics , 2008, WSDM '08.

[13]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[14]  Robert E. Tarjan,et al.  Dynamic trees as search trees via euler tours, applied to the network simplex algorithm , 1997, Math. Program..

[15]  Gabriel Peyré,et al.  Computational Optimal Transport , 2018, Found. Trends Mach. Learn..

[16]  John N. Tsitsiklis,et al.  Introduction to linear optimization , 1997, Athena scientific optimization and computation series.

[17]  Julie Delon,et al.  Local Matching Indicators for Transport Problems with Concave Costs , 2011, SIAM J. Discret. Math..

[18]  Andrew Zisserman,et al.  Smooth Loss Functions for Deep Top-k Classification , 2018, ICLR.

[19]  Tian Xia,et al.  Direct 0-1 Loss Minimization and Margin Maximization with Boosting , 2013, NIPS.

[20]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[21]  Silvia Chiappa,et al.  Wasserstein Fair Classification , 2019, UAI.

[22]  A Wilson,et al.  Use of entropy maximizing models in theory of trip distribution, mode split and route split , 1969 .

[23]  Qiang Wu,et al.  Learning to Rank Using an Ensemble of Lambda-Gradient Models , 2010, Yahoo! Learning to Rank Challenge.

[24]  Jean-Philippe Vert,et al.  Supervised Quantile Normalisation , 2017, ArXiv.

[25]  Gabriel Peyré,et al.  Wasserstein barycentric coordinates , 2016, ACM Trans. Graph..

[26]  Scott W. Linderman,et al.  Learning Latent Permutations with Gumbel-Sinkhorn Networks , 2018, ICLR.

[27]  Scott Sanner,et al.  Algorithms for Direct 0-1 Loss Optimization in Binary Classification , 2013, ICML.

[28]  A. Galichon,et al.  Matching with Trade-Offs: Revealed Preferences Over Competing Characteristics , 2009, 2102.12811.

[29]  Stefano Ermon,et al.  Stochastic Optimization of Sorting Networks via Continuous Relaxations , 2019, ICLR.

[30]  Alan L. Yuille,et al.  The invisible hand algorithm: Solving the assignment problem with statistical physics , 1994, Neural Networks.

[31]  Ryan P. Adams,et al.  Ranking via Sinkhorn Propagation , 2011, ArXiv.

[32]  Stephen P. Boyd,et al.  Accuracy at the Top , 2012, NIPS.

[33]  Julien Rabin,et al.  Sliced and Radon Wasserstein Barycenters of Measures , 2014, Journal of Mathematical Imaging and Vision.

[34]  Yaniv Romano,et al.  Conformalized Quantile Regression , 2019, NeurIPS.

[35]  R. Koenker,et al.  An interior point algorithm for nonlinear quantile regression , 1996 .

[36]  Bernhard Schmitzer,et al.  Stabilized Sparse Scaling Algorithms for Entropy Regularized Transport Problems , 2016, SIAM J. Sci. Comput..

[37]  Nicolas Courty,et al.  Wasserstein discriminant analysis , 2016, Machine Learning.

[38]  I. Barrodale,et al.  An Improved Algorithm for Discrete $l_1 $ Linear Approximation , 1973 .

[39]  Yang Zou,et al.  Sliced Wasserstein Kernels for Probability Distributions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Thomas Hofmann,et al.  Learning to Rank with Nonsmooth Cost Functions , 2006, NIPS.

[41]  F. Santambrogio Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling , 2015 .

[42]  P. Rousseeuw Least Median of Squares Regression , 1984 .