论文信息 - Stochastic Optimization of Sorting Networks via Continuous Relaxations - 字舞流文

Stochastic Optimization of Sorting Networks via Continuous Relaxations

Sorting input objects is an important step in many machine learning pipelines. However, the sorting operator is non-differentiable with respect to its inputs, which prohibits end-to-end gradient-based optimization. In this work, we propose NeuralSort, a general-purpose continuous relaxation of the output of the sorting operator from permutation matrices to the set of unimodal row-stochastic matrices, where every row sums to one and has a distinct arg max. This relaxation permits straight-through optimization of any computational graph involve a sorting operation. Further, we use this relaxation to enable gradient-based stochastic optimization over the combinatorially large space of permutations by deriving a reparameterized gradient estimator for the Plackett-Luce family of distributions over permutations. We demonstrate the usefulness of our framework on three tasks that require learning semantic orderings of high-dimensional objects, including a fully differentiable, parameterized extension of the k-nearest neighbors algorithm.

Stefano Ermon | Eric Wang | Aditya Grover | Aaron Zweig | S. Ermon | Aditya Grover | Aaron Zweig | Eric Wang

[1] Andrew Zisserman,et al. Smooth Loss Functions for Deep Top-k Classification , 2018, ICLR.

[2] Yee Whye Teh,et al. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[3] Tie-Yan Liu,et al. Listwise approach to learning to rank: theory and algorithm , 2008, ICML '08.

[4] Miguel Lázaro-Gredilla,et al. Doubly Stochastic Variational Bayes for non-Conjugate Inference , 2014, ICML.

[5] Lacra Pavel,et al. On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning , 2017, ArXiv.

[6] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[7] Pieter Abbeel,et al. Gradient Estimation Using Stochastic Computation Graphs , 2015, NIPS.

[8] Guillermo Sapiro,et al. Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching , 2013, NIPS.

[9] Stefano Ermon,et al. Exact Sampling with Integer Linear Programs and Random Perturbations , 2016, AAAI.

[10] Kilian Q. Weinberger,et al. Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[11] L. Thurstone. A law of comparative judgment. , 1994 .

[12] Alexandre d'Aspremont,et al. Convex Relaxations for Permutation Problems , 2013, SIAM J. Matrix Anal. Appl..

[13] Gregory N. Hullender,et al. Learning to rank using gradient descent , 2005, ICML.

[14] Ben Poole,et al. Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[15] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[16] Franco Scarselli,et al. SortNet: Learning to Rank by a Neural Preference Function , 2011, IEEE Transactions on Neural Networks.

[17] Scott W. Linderman,et al. Learning Latent Permutations with Gumbel-Sinkhorn Networks , 2018, ICLR.

[18] Yakov Bar-Shalom,et al. Multitarget-Multisensor Tracking: Principles and Techniques , 1995 .

[19] Ryan P. Adams,et al. Ranking via Sinkhorn Propagation , 2011, ArXiv.

[20] Zoubin Ghahramani,et al. Lost Relatives of the Gumbel Trick , 2017, ICML.

[21] Daan Wierstra,et al. Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[22] Wlodzimierz Ogryczak,et al. Minimizing the sum of the k largest functions in linear time , 2003, Inf. Process. Lett..

[23] A. Culyer. Thurstone’s Law of Comparative Judgment , 2014 .

[24] Graham Neubig,et al. A Continuous Relaxation of Beam Search for End-to-end Training of Neural Sequence Models , 2017, AAAI.

[25] Ravi Kumar,et al. Discrete Choice, Permutations, and Reconstruction , 2018, SODA.

[26] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .

[27] Scott W. Linderman,et al. Reparameterizing the Birkhoff Polytope for Variational Permutation Inference , 2017, AISTATS.

[28] Yoshua Bengio,et al. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[29] R. Plackett. The Analysis of Permutations , 1975 .

[30] Paul Glasserman,et al. Monte Carlo Methods in Financial Engineering , 2003 .

[31] J. Yellott. The relationship between Luce's Choice Axiom, Thurstone's Theory of Comparative Judgment, and the double exponential distribution , 1977 .

[32] A. A. J. Marley,et al. Behavioral Social Choice - Probabilistic Models, Statistical Inference, and Applications , 2006 .

[33] Stephen J. Wright,et al. Sorting Network Relaxations for Vector Permutation Problems , 2014, 1407.6609.

[34] Peter W. Glynn,et al. Likelihood ratio gradient estimation for stochastic systems , 1990, CACM.

[35] R. Duncan Luce,et al. Individual Choice Behavior: A Theoretical Analysis , 1979 .

[36] Stefano Ermon,et al. Fast Amortized Inference and Learning in Log-linear Models with Randomly Perturbed Nearest Neighbor Search , 2017, UAI.

[37] R. J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.