Stochastic Optimization of Sorting Networks via Continuous Relaxations

Sorting input objects is an important step in many machine learning pipelines. However, the sorting operator is non-differentiable with respect to its inputs, which prohibits end-to-end gradient-based optimization. In this work, we propose NeuralSort, a general-purpose continuous relaxation of the output of the sorting operator from permutation matrices to the set of unimodal row-stochastic matrices, where every row sums to one and has a distinct arg max. This relaxation permits straight-through optimization of any computational graph involve a sorting operation. Further, we use this relaxation to enable gradient-based stochastic optimization over the combinatorially large space of permutations by deriving a reparameterized gradient estimator for the Plackett-Luce family of distributions over permutations. We demonstrate the usefulness of our framework on three tasks that require learning semantic orderings of high-dimensional objects, including a fully differentiable, parameterized extension of the k-nearest neighbors algorithm.

[1]  Andrew Zisserman,et al.  Smooth Loss Functions for Deep Top-k Classification , 2018, ICLR.

[2]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[3]  Tie-Yan Liu,et al.  Listwise approach to learning to rank: theory and algorithm , 2008, ICML '08.

[4]  Miguel Lázaro-Gredilla,et al.  Doubly Stochastic Variational Bayes for non-Conjugate Inference , 2014, ICML.

[5]  Lacra Pavel,et al.  On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning , 2017, ArXiv.

[6]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[7]  Pieter Abbeel,et al.  Gradient Estimation Using Stochastic Computation Graphs , 2015, NIPS.

[8]  Guillermo Sapiro,et al.  Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching , 2013, NIPS.

[9]  Stefano Ermon,et al.  Exact Sampling with Integer Linear Programs and Random Perturbations , 2016, AAAI.

[10]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[11]  L. Thurstone A law of comparative judgment. , 1994 .

[12]  Alexandre d'Aspremont,et al.  Convex Relaxations for Permutation Problems , 2013, SIAM J. Matrix Anal. Appl..

[13]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[14]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[15]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[16]  Franco Scarselli,et al.  SortNet: Learning to Rank by a Neural Preference Function , 2011, IEEE Transactions on Neural Networks.

[17]  Scott W. Linderman,et al.  Learning Latent Permutations with Gumbel-Sinkhorn Networks , 2018, ICLR.

[18]  Yakov Bar-Shalom,et al.  Multitarget-Multisensor Tracking: Principles and Techniques , 1995 .

[19]  Ryan P. Adams,et al.  Ranking via Sinkhorn Propagation , 2011, ArXiv.

[20]  Zoubin Ghahramani,et al.  Lost Relatives of the Gumbel Trick , 2017, ICML.

[21]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[22]  Wlodzimierz Ogryczak,et al.  Minimizing the sum of the k largest functions in linear time , 2003, Inf. Process. Lett..

[23]  A. Culyer Thurstone’s Law of Comparative Judgment , 2014 .

[24]  Graham Neubig,et al.  A Continuous Relaxation of Beam Search for End-to-end Training of Neural Sequence Models , 2017, AAAI.

[25]  Ravi Kumar,et al.  Discrete Choice, Permutations, and Reconstruction , 2018, SODA.

[26]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[27]  Scott W. Linderman,et al.  Reparameterizing the Birkhoff Polytope for Variational Permutation Inference , 2017, AISTATS.

[28]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[29]  R. Plackett The Analysis of Permutations , 1975 .

[30]  Paul Glasserman,et al.  Monte Carlo Methods in Financial Engineering , 2003 .

[31]  J. Yellott The relationship between Luce's Choice Axiom, Thurstone's Theory of Comparative Judgment, and the double exponential distribution , 1977 .

[32]  A. A. J. Marley,et al.  Behavioral Social Choice - Probabilistic Models, Statistical Inference, and Applications , 2006 .

[33]  Stephen J. Wright,et al.  Sorting Network Relaxations for Vector Permutation Problems , 2014, 1407.6609.

[34]  Peter W. Glynn,et al.  Likelihood ratio gradient estimation for stochastic systems , 1990, CACM.

[35]  R. Duncan Luce,et al.  Individual Choice Behavior: A Theoretical Analysis , 1979 .

[36]  Stefano Ermon,et al.  Fast Amortized Inference and Learning in Log-linear Models with Randomly Perturbed Nearest Neighbor Search , 2017, UAI.

[37]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.