Unsupervised Alignment of Embeddings with Wasserstein Procrustes

We consider the task of aligning two sets of points in high dimension, which has many applications in natural language processing and computer vision. As an example, it was recently shown that it is possible to infer a bilingual lexicon, without supervised data, by aligning word embeddings trained on monolingual data. These recent advances are based on adversarial training to learn the mapping between the two embeddings. In this paper, we propose to use an alternative formulation, based on the joint estimation of an orthogonal matrix and a permutation matrix. While this problem is not convex, we propose to initialize our optimization algorithm by using a convex relaxation, traditionally considered for the graph isomorphism problem. We propose a stochastic algorithm to minimize our cost function on large scale problems. Finally, we evaluate our method on the problem of unsupervised word translation, by aligning word embeddings trained on monolingual data. On this task, our method obtains state of the art results, while requiring less computational resources than competing approaches.

[1]  Philip Wolfe,et al.  An algorithm for quadratic programming , 1956 .

[2]  P. Schönemann,et al.  A generalized solution of the orthogonal procrustes problem , 1966 .

[3]  R. Dudley The Speed of Mean Glivenko-Cantelli Convergence , 1969 .

[4]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[5]  John E. Hopcroft,et al.  Linear time algorithm for isomorphism of planar graphs (Preliminary Report) , 1974, STOC '74.

[6]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[7]  C. Goodall Procrustes methods in the statistical analysis of shape , 1991 .

[8]  Paul J. Besl,et al.  A Method for Registration of 3-D Shapes , 1992, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Salih O. Duffuaa,et al.  A Linear Programming Approach for the Weighted Graph Matching Problem , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Pascale Fung,et al.  Compiling Bilingual Lexicon Entries From a Non-Parallel English-Chinese Corpus , 1995, VLC@ACL.

[11]  Reinhard Rapp,et al.  Identifying Word Translations in Non-Parallel Texts , 1995, ACL.

[12]  Ludek Kucera,et al.  Expected Complexity of Graph Partitioning Problems , 1995, Discret. Appl. Math..

[13]  Timothy F. Cootes,et al.  Active Shape Models-Their Training and Application , 1995, Comput. Vis. Image Underst..

[14]  Steven Gold,et al.  A Graduated Assignment Algorithm for Graph Matching , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Anand Rangarajan,et al.  The Softassign Procrustes Matching Algorithm , 1997, IPMI.

[16]  Douglas A. Reynolds,et al.  SHEEP, GOATS, LAMBS and WOLVES A Statistical Analysis of Speaker Performance in the NIST 1998 Speaker Recognition Evaluation , 1998 .

[17]  Leonidas J. Guibas,et al.  A metric for distributions with applications to image databases , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[18]  François Pachet,et al.  Improving Timbre Similarity : How high’s the sky ? , 2004 .

[19]  Takeo Kanade,et al.  Shape and motion from image streams under orthography: a factorization method , 1992, International Journal of Computer Vision.

[20]  Mario Vento,et al.  Thirty Years Of Graph Matching In Pattern Recognition , 2004, Int. J. Pattern Recognit. Artif. Intell..

[21]  Martial Hebert,et al.  A spectral technique for correspondence problems using pairwise constraints , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[22]  Facundo Mémoli,et al.  Eurographics Symposium on Point-based Graphics (2007) on the Use of Gromov-hausdorff Distances for Shape Comparison , 2022 .

[23]  Antonio Torralba,et al.  SIFT Flow: Dense Correspondence across Different Scenes , 2008, ECCV.

[24]  François Pachet,et al.  A scale-free distribution of false positives for a large class of audio similarity measures , 2008, Pattern Recognit..

[25]  M. Zaslavskiy,et al.  A Path Following Algorithm for the Graph Matching Problem , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Levent Tunçel,et al.  Optimization algorithms on matrix manifolds , 2009, Math. Comput..

[27]  Cordelia Schmid,et al.  Accurate Image Search Using the Contextual Dissimilarity Measure , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[29]  Alexandros Nanopoulos,et al.  Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data , 2010, J. Mach. Learn. Res..

[30]  Santosh S. Vempala,et al.  Statistical Algorithms and a Lower Bound for Planted Clique , 2012, Electron. Colloquium Comput. Complex..

[31]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[32]  Quoc V. Le,et al.  Exploiting Similarities among Languages for Machine Translation , 2013, ArXiv.

[33]  Philippe Rigollet,et al.  Complexity Theoretic Lower Bounds for Sparse Principal Component Detection , 2013, COLT.

[34]  Mario Vento,et al.  Graph Matching and Learning in Pattern Recognition in the Last 10 Years , 2014, Int. J. Pattern Recognit. Artif. Intell..

[35]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[36]  Quentin Berthet,et al.  Optimal Testing for Planted Satisfiability Problems , 2014, ArXiv.

[37]  Alexander M. Bronstein,et al.  Graph matching: relax or not? , 2014, ArXiv.

[38]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[39]  Dong Wang,et al.  Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation , 2015, NAACL.

[40]  Georgiana Dinu,et al.  Improving zero-shot learning by mitigating the hubness problem , 2014, ICLR.

[41]  Guillermo Sapiro,et al.  Graph Matching: Relax at Your Own Risk , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Tiejun Zhao,et al.  A Distribution-based Model to Learn Bilingual Word Embeddings , 2016, COLING.

[43]  Vladimir G. Kim,et al.  Entropic metric alignment for correspondence problems , 2016, ACM Trans. Graph..

[44]  Eneko Agirre,et al.  Learning principled bilingual mappings of word embeddings while preserving monolingual invariance , 2016, EMNLP.

[45]  Yaniv Plan,et al.  Average-case hardness of RIP certification , 2016, NIPS.

[46]  Gabriel Peyré,et al.  Fast Dictionary Learning with a Smoothed Wasserstein Loss , 2016, AISTATS.

[47]  Victor M. Panaretos,et al.  Amplitude and phase variation of point processes , 2016, 1603.08691.

[48]  Samuel L. Smith,et al.  Offline bilingual word vectors, orthogonal transformations and the inverted softmax , 2017, ICLR.

[49]  C. Robert,et al.  Inference in generative models using the Wasserstein distance , 2017, 1701.05146.

[50]  Eneko Agirre,et al.  Learning bilingual word embeddings with (almost) no bilingual data , 2017, ACL.

[51]  Jason Altschuler,et al.  Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration , 2017, NIPS.

[52]  Armand Joulin,et al.  Unsupervised Learning by Predicting Noise , 2017, ICML.

[53]  Meng Zhang,et al.  Earth Mover’s Distance Minimization for Unsupervised Bilingual Lexicon Induction , 2017, EMNLP.

[54]  Meng Zhang,et al.  Adversarial Training for Unsupervised Bilingual Lexicon Induction , 2017, ACL.

[55]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[56]  Anders Søgaard,et al.  Why is unsupervised alignment of English embeddings from different algorithms so hard? , 2018, EMNLP.

[57]  Tommi S. Jaakkola,et al.  Gromov-Wasserstein Alignment of Word Embedding Spaces , 2018, EMNLP.

[58]  Yiming Yang,et al.  Unsupervised Cross-lingual Transfer of Word Embedding Spaces , 2018, EMNLP.

[59]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[60]  Quentin Berthet,et al.  Optimal link prediction with matrix logistic regression , 2018, 1803.07054.

[61]  Gabriel Peyré,et al.  Learning Generative Models with Sinkhorn Divergences , 2017, AISTATS.

[62]  Lior Wolf,et al.  An Iterative Closest Point Method for Unsupervised Word Translation , 2018, ArXiv.

[63]  Lior Wolf,et al.  Non-Adversarial Unsupervised Word Translation , 2018, EMNLP.

[64]  Eneko Agirre,et al.  A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings , 2018, ACL.

[65]  Nicolas Courty,et al.  Wasserstein discriminant analysis , 2016, Machine Learning.

[66]  Max Sommerfeld,et al.  Empirical optimal transport on countable metric spaces: Distributional limits and statistical applications , 2017, The Annals of Applied Probability.

[67]  F. Bach,et al.  Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance , 2017, Bernoulli.

[68]  Axel Munk,et al.  Optimal Transport: Fast Probabilistic Approximation with Exact Solvers , 2018, J. Mach. Learn. Res..

[69]  Gabriel Peyré,et al.  Computational Optimal Transport , 2018, Found. Trends Mach. Learn..

[70]  Victor M. Panaretos,et al.  Fréchet means and Procrustes analysis in Wasserstein space , 2017, Bernoulli.

[71]  Tommi S. Jaakkola,et al.  Towards Optimal Transport with Global Invariances , 2018, AISTATS.

[72]  Jordan S. Ellenberg,et al.  Detection of Planted Solutions for Flat Satisfiability Problems , 2019, AISTATS.