Fast Differentiable Sorting and Ranking

The sorting operation is one of the most commonly used building blocks in computer programming. In machine learning, it is often used for robust statistics. However, seen as a function, it is piecewise linear and as a result includes many kinks where it is non-differentiable. More problematic is the related ranking operator, often used for order statistics and ranking metrics. It is a piecewise constant function, meaning that its derivatives are null or undefined. While numerous works have proposed differentiable proxies to sorting and ranking, they do not achieve the $O(n \log n)$ time complexity one would expect from sorting and ranking operations. In this paper, we propose the first differentiable sorting and ranking operators with $O(n \log n)$ time and $O(n)$ space complexity. Our proposal in addition enjoys exact computation and differentiation. We achieve this feat by constructing differentiable operators as projections onto the permutahedron, the convex hull of permutations, and using a reduction to isotonic optimization. Empirically, we confirm that our approach is an order of magnitude faster than existing approaches and showcase two novel applications: differentiable Spearman's rank correlation coefficient and least trimmed squares.

[1]  O. William Journal Of The American Statistical Association V-28 , 1932 .

[2]  Richard Sinkhorn,et al.  Concerning nonnegative matrices and doubly stochastic matrices , 1967 .

[3]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[4]  P. Rousseeuw Least Median of Squares Regression , 1984 .

[5]  Peter J. Rousseeuw,et al.  Robust regression and outlier detection , 1987 .

[6]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[7]  G. Ziegler Lectures on Polytopes , 1994 .

[8]  Michael J. Best,et al.  Minimizing Separable Convex Functions Subject to Simple Chain Constraints , 1999, SIAM J. Optim..

[9]  Jack Edmonds,et al.  Submodular Functions, Matroids, and Certain Polyhedra , 2001, Combinatorial Optimization.

[10]  Stephen E. Robertson,et al.  SoftRank: optimizing non-smooth rank metrics , 2008, WSDM '08.

[11]  Eyke Hüllermeier,et al.  Label ranking by learning pairwise preferences , 2008, Artif. Intell..

[12]  Mingrui Wu,et al.  Gradient descent optimization of smoothed information retrieval metrics , 2010, Information Retrieval.

[13]  Eyke Hüllermeier,et al.  Decision tree and instance-based learning for label ranking , 2009, ICML '09.

[14]  Tao Qin,et al.  A general approximation framework for direct optimization of information retrieval measures , 2010, Information Retrieval.

[15]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[16]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[17]  Ryan P. Adams,et al.  Ranking via Sinkhorn Propagation , 2011, ArXiv.

[18]  Masayuki Takeda,et al.  Online Linear Optimization over Permutations , 2011, ISAAC.

[19]  Shuji Kijima,et al.  Online Prediction under Submodular Constraints , 2012, ALT.

[20]  lexander,et al.  THE GENERALIZED SIMPLEX METHOD FOR MINIMIZING A LINEAR FORM UNDER LINEAR INEQUALITY RESTRAINTS , 2012 .

[21]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[22]  Francis R. Bach,et al.  Learning with Submodular Functions: A Convex Optimization Perspective , 2011, Found. Trends Mach. Learn..

[23]  Andre Martins,et al.  Orbit Regularization , 2014, NIPS.

[24]  Nir Ailon,et al.  Bandit Online Optimization over the Permutahedron , 2014, ALT.

[25]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[26]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[27]  Stephen J. Wright,et al.  Efficient Bregman Projections onto the Permutahedron and Related Polytopes , 2016, AISTATS.

[28]  Ramón Fernández Astudillo,et al.  From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification , 2016, ICML.

[29]  Bernt Schiele,et al.  Loss Functions for Top-k Error: Analysis and Insights , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Andreas Krause,et al.  Differentiable Learning of Submodular Models , 2017, NIPS 2017.

[31]  Florence d'Alché-Buc,et al.  A Structured Prediction Approach for Label Ranking , 2018, NeurIPS.

[32]  Claire Cardie,et al.  SparseMAP: Differentiable Sparse Structured Inference , 2018, ICML.

[33]  Stefano Ermon,et al.  Stochastic Optimization of Sorting Networks via Continuous Relaxations , 2019, ICLR.

[34]  Marco Cuturi,et al.  Differentiable Ranks and Sorting using Optimal Transport , 2019, 1905.11885.

[35]  Gabriel Peyré,et al.  Computational Optimal Transport , 2018, Found. Trends Mach. Learn..

[36]  Mathieu Blondel,et al.  Structured Prediction with Projection Oracles , 2019, NeurIPS.

[37]  André F. T. Martins,et al.  Learning Classifiers with Fenchel-Young Losses: Generalized Entropies, Margins, and Algorithms , 2018, AISTATS.

[38]  André F. T. Martins,et al.  Learning with Fenchel-Young Losses , 2019, J. Mach. Learn. Res..

[39]  G. Martius,et al.  Optimizing Rank-Based Metrics With Blackbox Differentiation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).