Escaping the Curse of Dimensionality in Similarity Learning: Efficient Frank-Wolfe Algorithm and Generalization Bounds

Similarity and metric learning provides a principled approach to construct a task-specific similarity from weakly supervised data. However, these methods are subject to the curse of dimensionality: as the number of features grows large, poor generalization is to be expected and training becomes intractable due to high computational and memory costs. In this paper, we propose a similarity learning method that can efficiently deal with high-dimensional sparse data. This is achieved through a parameterization of similarity functions by convex combinations of sparse rank-one matrices, together with the use of a greedy approximate Frank-Wolfe algorithm which provides an efficient way to control the number of active features. We show that the convergence rate of the algorithm, as well as its time and memory complexity, are independent of the data dimension. We further provide a theoretical justification of our modeling choices through an analysis of the generalization error, which depends logarithmically on the sparsity of the solution rather than on the number of features. Our experiments on datasets with up to one million features demonstrate the ability of our approach to generalize well despite the high dimensionality as well as its superiority compared to several competing methods.

[1]  Chih-Jen Lin,et al.  Training and Testing Low-degree Polynomial Data Mappings via Linear SVM , 2010, J. Mach. Learn. Res..

[2]  Marc Sebban,et al.  A Survey on Metric Learning for Feature Vectors and Structured Data , 2013, ArXiv.

[3]  Paul Grigas,et al.  New analysis and results for the Frank–Wolfe method , 2013, Mathematical Programming.

[4]  Philip Wolfe,et al.  An algorithm for quadratic programming , 1956 .

[5]  Dmitriy Fradkin,et al.  Experiments with random projections for machine learning , 2003, KDD '03.

[6]  Stephen Tyree,et al.  Non-linear Metric Learning , 2012, NIPS.

[7]  Alan J. Lee,et al.  U-Statistics: Theory and Practice , 1990 .

[8]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[9]  Rich Caruana,et al.  An empirical evaluation of supervised learning in high dimensions , 2008, ICML '08.

[10]  Lijun Zhang,et al.  Efficient Stochastic Optimization for Low-Rank Distance Metric Learning , 2017, AAAI.

[11]  Martin Jaggi,et al.  Sparse Convex Optimization Methods for Machine Learning , 2011 .

[12]  Gert R. G. Lanckriet,et al.  Robust Structural Metric Learning , 2013, ICML.

[13]  R. Freund,et al.  New Analysis and Results for the Conditional Gradient Method , 2013 .

[14]  Brian Kulis,et al.  Metric Learning: A Survey , 2013, Found. Trends Mach. Learn..

[15]  Steve R. Gunn,et al.  Result Analysis of the NIPS 2003 Feature Selection Challenge , 2004, NIPS.

[16]  John F. Canny,et al.  Large-scale behavioral targeting , 2009, KDD.

[17]  G. Lugosi,et al.  Ranking and empirical minimization of U-statistics , 2006, math/0603123.

[18]  Rong Jin,et al.  Towards Making High Dimensional Distance Metric Learning Practical , 2015, ArXiv.

[19]  Inderjit S. Dhillon,et al.  Information-theoretic metric learning , 2006, ICML '07.

[20]  Samy Bengio,et al.  An Online Algorithm for Large Scale Image Similarity Learning , 2009, NIPS.

[21]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[22]  Rong Jin,et al.  Fine-grained visual categorization via multi-stage metric learning , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Deng Cai,et al.  Manifold Adaptive Experimental Design for Text Categorization , 2012, IEEE Transactions on Knowledge and Data Engineering.

[24]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[25]  R. Serfling Probability Inequalities for the Sum in Sampling without Replacement , 1974 .

[26]  Odalric-Ambrym Maillard,et al.  Concentration inequalities for sampling without replacement , 2013, 1309.4029.

[27]  Marc Sebban,et al.  Similarity Learning for Provably Accurate Sparse Linear Classification , 2012, ICML.

[28]  Peng Li,et al.  Distance Metric Learning with Eigenvalue Optimization , 2012, J. Mach. Learn. Res..

[29]  Martin Jaggi,et al.  Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization , 2013, ICML.

[30]  Stéphan Clémençon,et al.  Scaling-up Empirical Risk Minimization: Optimization of Incomplete $U$-statistics , 2015, J. Mach. Learn. Res..

[31]  Kristin Branson,et al.  Sample Complexity of Learning Mahalanobis Distance Metrics , 2015, NIPS.

[32]  Jun Huan,et al.  Sparse Compositional Local Metric Learning , 2017, KDD.

[33]  Martin Jaggi,et al.  On the Global Linear Convergence of Frank-Wolfe Optimization Variants , 2015, NIPS.

[34]  W. Hoeffding A Class of Statistics with Asymptotically Normal Distribution , 1948 .

[35]  Amaury Habrard,et al.  Robustness and generalization for metric learning , 2012, Neurocomputing.

[36]  Tat-Seng Chua,et al.  An efficient sparse metric learning in high-dimensional space via l1-penalized log-determinant regularization , 2009, ICML '09.

[37]  Qiong Cao,et al.  Generalization bounds for metric and similarity learning , 2012, Machine Learning.

[38]  Lei Wang,et al.  Positive Semidefinite Metric Learning Using Boosting-like Algorithms , 2011, J. Mach. Learn. Res..

[39]  Byoung-Tak Zhang,et al.  Generative Local Metric Learning for Nearest Neighbor Classification , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[41]  Glenn Fung,et al.  Learning sparse metrics via linear programming , 2006, KDD '06.

[42]  Matthieu Cord,et al.  Learning a Distance Metric from Relative Comparisons between Quadruplets of Images , 2016, International Journal of Computer Vision.

[43]  Rong Jin,et al.  An Integrated Framework for High Dimensional Distance Metric Learning and Its Application to Fine-Grained Visual Categorization , 2014, ArXiv.

[44]  Thorsten Joachims,et al.  Learning a Distance Metric from Relative Comparisons , 2003, NIPS.

[45]  Kenneth L. Clarkson,et al.  Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm , 2008, SODA '08.

[46]  Cordelia Schmid,et al.  Is that you? Metric learning approaches for face identification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[47]  Gao Cong,et al.  High-dimensional Similarity Learning via Dual-sparse Random Projection , 2018, IJCAI.

[48]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[49]  Maik Moeller,et al.  An Introduction To Chemoinformatics , 2016 .

[50]  Ji Wan,et al.  SOML: Sparse Online Metric Learning with Application to Image Retrieval , 2014, AAAI.

[51]  Geoffrey E. Hinton,et al.  Neighbourhood Components Analysis , 2004, NIPS.

[52]  Marc Sebban,et al.  Metric Learning , 2015, Metric Learning.

[53]  Yuan Shi,et al.  Sparse Compositional Metric Learning , 2014, AAAI.

[54]  Alexandros Kalousis,et al.  Parametric Local Metric Learning for Nearest Neighbor Classification , 2012, NIPS.

[55]  Yiming Ying,et al.  Guaranteed Classification via Regularized Similarity Learning , 2013, Neural Computation.

[56]  Rongrong Ji,et al.  Low-Rank Similarity Metric Learning in High Dimensions , 2015, AAAI.

[57]  Dacheng Tao,et al.  Learning a Distance Metric by Empirical Loss Minimization , 2011, IJCAI.

[58]  Patrice Marcotte,et al.  Some comments on Wolfe's ‘away step’ , 1986, Math. Program..

[59]  Peng Li,et al.  Distance Metric Learning Revisited , 2012, ECML/PKDD.

[60]  Fei Sha,et al.  Similarity Learning for High-Dimensional Sparse Data , 2014, AISTATS.

[61]  Rong Jin,et al.  Regularized Distance Metric Learning: Theory and Algorithm , 2009, NIPS.

[62]  Gal Chechik,et al.  Learning Sparse Metrics, One Feature at a Time , 2015, FE@NIPS.

[63]  Kaizhu Huang,et al.  Sparse Metric Learning via Smooth Optimization , 2009, NIPS.

[64]  Holger Rauhut,et al.  A Mathematical Introduction to Compressive Sensing , 2013, Applied and Numerical Harmonic Analysis.

[65]  Priyanka Agrawal,et al.  Link Label Prediction in Signed Social Networks , 2013, IJCAI.

[66]  Lalit Jain,et al.  Learning Low-Dimensional Metrics , 2017, NIPS.