Consistent Semi-Supervised Graph Regularization for High Dimensional Data

Semi-supervised Laplacian regularization, a standard graph-based approach for learning from both labelled and unlabelled data, was recently demonstrated to have an insignificant high dimensional learning efficiency with respect to unlabelled data (Mai and Couillet 2018), causing it to be outperformed by its unsupervised counterpart, spectral clustering, given sufficient unlabelled data. Following a detailed discussion on the origin of this inconsistency problem, a novel regularization approach involving centering operation is proposed as solution, supported by both theoretical analysis and empirical results.

[1]  Shai Ben-David,et al.  Does Unlabeled Data Provably Help? Worst-case Analysis of the Sample Complexity of Semi-Supervised Learning , 2008, COLT.

[2]  Fabrizio Angiulli,et al.  On the Behavior of Intrinsically High-Dimensional Spaces: Distances, Direct and Reverse Nearest Neighbors, and Hubness , 2017, J. Mach. Learn. Res..

[3]  Konstantin Avrachenkov,et al.  Generalized Optimization Framework for Graph-based Semi-supervised Learning , 2011, SDM.

[4]  Romain Couillet,et al.  Large Sample Covariance Matrices of Concentrated Vectors , 2018 .

[5]  Ulrike von Luxburg,et al.  Phase transition in the family of p-resistances , 2011, NIPS.

[6]  Mikhail Belkin,et al.  Using Manifold Stucture for Partially Labeled Classification , 2002, NIPS.

[7]  Marc Lelarge,et al.  Asymptotic Bayes Risk for Gaussian Mixture in a Semi-Supervised Setting , 2019, 2019 IEEE 8th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP).

[8]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[9]  Eamonn J. Keogh Nearest Neighbor , 2010, Encyclopedia of Machine Learning.

[10]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[11]  Laurent Massoulié,et al.  A spectral method for community detection in moderately sparse degree-corrected stochastic block models , 2015, Advances in Applied Probability.

[12]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  P. Bickel,et al.  On robust regression with high-dimensional predictors , 2013, Proceedings of the National Academy of Sciences.

[14]  Mikhail Belkin,et al.  Consistency of spectral clustering , 2008, 0804.0678.

[15]  Pascal Frossard,et al.  The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains , 2012, IEEE Signal Processing Magazine.

[16]  B. Nadler,et al.  Semi-supervised learning with the graph Laplacian: the limit of infinite unlabelled data , 2009, NIPS 2009.

[17]  Romain Couillet,et al.  Concentration of Measure and Large Random Matrices with an application to Sample Covariance Matrices , 2018, 1805.08295.

[18]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[19]  Mikhail Belkin,et al.  Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[20]  Fabio Gagliardi Cozman,et al.  Unlabeled Data Can Degrade Classification Performance of Generative Classifiers , 2002, FLAIRS.

[21]  R. Couillet,et al.  Spectral analysis of the Gram matrix of mixture models , 2015, 1510.03463.

[22]  O. Chapelle,et al.  Semi-Supervised Learning (Chapelle, O. et al., Eds.; 2006) [Book reviews] , 2009, IEEE Transactions on Neural Networks.

[23]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[24]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[25]  R. Couillet,et al.  Random Matrix Methods for Wireless Communications: Estimation , 2011 .

[26]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[27]  J. W. Silverstein,et al.  Eigenvalues of large sample covariance matrices of spiked population models , 2004, math/0408165.

[28]  Michel Verleysen,et al.  The Concentration of Fractional Distances , 2007, IEEE Transactions on Knowledge and Data Engineering.

[29]  Mikhail Belkin,et al.  Semi-Supervised Learning on Riemannian Manifolds , 2004, Machine Learning.

[30]  Ahmed El Alaoui,et al.  Asymptotic behavior of \(\ell_p\)-based Laplacian regularization in semi-supervised learning , 2016, COLT.

[31]  David A. Landgrebe,et al.  The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon , 1994, IEEE Trans. Geosci. Remote. Sens..

[32]  Zoubin Ghahramani,et al.  Learning from labeled and unlabeled data with label propagation , 2002 .

[33]  Raj Rao Nadakuditi,et al.  The singular values and vectors of low rank perturbations of large rectangular random matrices , 2011, J. Multivar. Anal..

[34]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[35]  R. Couillet,et al.  Kernel spectral clustering of large dimensional data , 2015, 1510.03547.

[36]  J. Sherman,et al.  Adjustment of an Inverse Matrix Corresponding to a Change in One Element of a Given Matrix , 1950 .

[37]  Mikhail Belkin,et al.  Semi-supervised Learning by Higher Order Regularization , 2011, AISTATS.

[38]  M. Ledoux The concentration of measure phenomenon , 2001 .

[39]  Romain Couillet,et al.  A random matrix analysis and improvement of semi-supervised learning for large dimensional data , 2017, J. Mach. Learn. Res..

[40]  Daniel A. Spielman,et al.  Algorithms for Lipschitz Learning on Graphs , 2015, COLT.

[41]  Amin Coja-Oghlan,et al.  Finding Planted Partitions in Random Graphs with General Degree Distributions , 2009, SIAM J. Discret. Math..

[42]  R. Cooke Real and Complex Analysis , 2011 .

[43]  Xiaojin Zhu,et al.  p-voltages: Laplacian Regularization for Semi-Supervised Learning on High-Dimensional Data , 2013 .

[44]  W. Rudin Real and complex analysis , 1968 .

[45]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..