Feature Robust Optimal Transport for High-dimensional Data

Optimal transport is a machine learning problem with applications including distribution comparison, feature selection, and generative adversarial networks. In this paper, we propose feature robust optimal transport (FROT) for high-dimensional data, which jointly solves feature selection and OT problems. Specifically, we formulate the FROT problem as a min--max optimization problem. Then, we propose a convex formulation of FROT and solve it with the Frank--Wolfe-based optimization algorithm, where the sub-problem can be efficiently solved using the Sinkhorn algorithm. A key advantage of FROT is that important features can be analytically determined by simply solving the convex optimization problem. Furthermore, we propose using the FROT algorithm for the layer selection problem in deep neural networks for semantic correspondence. By conducting synthetic and benchmark experiments, we demonstrate that the proposed method can determine important features. Additionally, we show that the FROT algorithm achieves a state-of-the-art performance in real-world semantic correspondence datasets.

[1]  Lawrence Carin,et al.  Scalable Gromov-Wasserstein Learning for Graph Partitioning and Matching , 2019, NeurIPS.

[2]  Stefanie Jegelka,et al.  Learning Generative Models across Incomparable Spaces , 2019, ICML.

[3]  Jamal Atif,et al.  Equitable and Optimal Transport with Multiple Agents , 2020, AISTATS.

[4]  Barnabás Póczos,et al.  On the Estimation of alpha-Divergences , 2011, AISTATS.

[5]  Tomás Pajdla,et al.  Neighbourhood Consensus Networks , 2018, NeurIPS.

[6]  Luc Van Gool,et al.  Sliced Wasserstein Generative Models , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Marco Cuturi,et al.  Wasserstein regularization for sparse multi-task regression , 2018, AISTATS.

[8]  Wen Li,et al.  Semi-Supervised Optimal Transport for Heterogeneous Domain Adaptation , 2018, IJCAI.

[9]  Arnaud Doucet,et al.  Fast Computation of Wasserstein Barycenters , 2013, ICML.

[10]  C. Villani Optimal Transport: Old and New , 2008 .

[11]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[12]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[13]  S. M. Ali,et al.  A General Class of Coefficients of Divergence of One Distribution from Another , 1966 .

[14]  Jamal Atif,et al.  Handling Multiple Costs in Optimal Transport: Strong Duality and Efficient Computation , 2020, ArXiv.

[15]  Roland Badeau,et al.  Generalized Sliced Wasserstein Distances , 2019, NeurIPS.

[16]  Lacra Pavel,et al.  On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning , 2017, ArXiv.

[17]  Tommi S. Jaakkola,et al.  Unsupervised Hierarchy Matching with Optimal Transport over Hyperbolic Spaces , 2020, AISTATS.

[18]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[19]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[20]  Kenji Fukumizu,et al.  Tree-Sliced Approximation of Wasserstein Distances , 2019, ArXiv.

[21]  David A. Forsyth,et al.  Max-Sliced Wasserstein Distance and Its Use for GANs , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Motoaki Kawanabe,et al.  Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation , 2007, NIPS.

[23]  Gabriel Peyré,et al.  Computational Optimal Transport , 2018, Found. Trends Mach. Learn..

[24]  Flemming Topsøe,et al.  Jensen-Shannon divergence and Hilbert space embedding , 2004, International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings..

[25]  Rémi Emonet,et al.  A Swiss Army Knife for Minimax Optimal Transport , 2020, ICML.

[26]  Philip Wolfe,et al.  An algorithm for quadratic programming , 1956 .

[27]  G. Crooks On Measures of Entropy and Information , 2015 .

[28]  Yang Zou,et al.  Sliced Wasserstein Kernels for Probability Distributions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Josef Sivic,et al.  End-to-End Weakly-Supervised Semantic Alignment , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Yi Yang,et al.  LSMI-Sinkhorn: Semi-supervised Squared-Loss Mutual Information Estimation with Optimal Transport , 2019, ArXiv.

[32]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[33]  Hisashi Kashima,et al.  Fast Unbalanced Optimal Transport on Tree , 2020, ArXiv.

[34]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[35]  Jean Ponce,et al.  Hyperpixel Flow: Semantic Correspondence With Multi-Layer Neural Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Justin Solomon,et al.  Hierarchical Optimal Transport for Document Representation , 2019, NeurIPS.

[37]  Tomasz Malisiewicz,et al.  SuperGlue: Learning Feature Matching With Graph Neural Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Le Song,et al.  A Kernel Statistical Test of Independence , 2007, NIPS.

[39]  Jean Ponce,et al.  SPair-71k: A Large-scale Benchmark for Semantic Correspondence , 2019, ArXiv.

[40]  Tommi S. Jaakkola,et al.  Structured Optimal Transport , 2018, AISTATS.

[41]  Vivien Seguy,et al.  Smooth and Sparse Optimal Transport , 2017, AISTATS.

[42]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[44]  Bohyung Han,et al.  Attentive Semantic Alignment with Offset-Aware Correlation Kernels , 2018, ECCV.

[45]  Takafumi Kanamori,et al.  Relative Density-Ratio Estimation for Robust Distribution Comparison , 2011, Neural Computation.

[46]  Josef Sivic,et al.  Convolutional Neural Network Architecture for Geometric Matching , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  S. Evans,et al.  The phylogenetic Kantorovich–Rubinstein metric for environmental sequence samples , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[48]  Martin Jaggi,et al.  Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization , 2013, ICML.

[49]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[50]  Hongyuan Zha,et al.  Gromov-Wasserstein Learning for Graph Matching and Node Embedding , 2019, ICML.

[51]  Marco Cuturi,et al.  Regularized Optimal Transport is Ground Cost Adversarial , 2020, ICML.

[52]  Makoto Yamada,et al.  Semantic Correspondence as an Optimal Transport Problem , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  A. Müller Integral Probability Metrics and Their Generating Classes of Functions , 1997, Advances in Applied Probability.

[54]  Marco Cuturi,et al.  Subspace Robust Wasserstein distances , 2019, ICML.