Theory of the GMM Kernel

In web search, data mining, and machine learning, two popular measures of data similarity are the cosine and the resemblance (the latter is for binary data). In this study, we develop theoretical results for both the cosine and the GMM (generalized min-max) kernel, which is a generalization of the resemblance. GMM has direct applications in machine learning as a positive definite kernel and can be efficiently linearized via probabilistic hashing to handle big data. Owing to its discrete nature, the hashed values can also be used to build hash tables for efficient near neighbor search. We prove the theoretical limit of GMM and the consistency result, assuming that the data follow an elliptical distribution, which is a general family of distributions and includes the multivariate normal and t-distribution as special cases. The consistency result holds as long as the data have bounded first moment (an assumption which typically holds for data commonly encountered in practice). Furthermore, we establish the asymptotic normality of GMM. We also prove the limit of cosine under elliptical distributions. In comparison, the consistency of GMM requires much weaker conditions. For example, when data follow a t-distribution with ν degrees of freedom, GMM typically provides a better estimate of similarity than cosine when ν < 8 (ν = 8 means the distribution is very close to normal). These theoretical results help explain the recent success of GMM and lay the foundation for further research.

[1]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[2]  Le Song,et al.  Scalable Kernel Methods via Doubly Stochastic Gradients , 2014, NIPS.

[3]  Silvio Lattanzi,et al.  On compressing social networks , 2009, KDD.

[4]  Matthias W. Seeger,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[5]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[6]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[7]  Georges Goetz,et al.  Recognizing retinal ganglion cells in the dark , 2015, NIPS.

[8]  Sreenivas Gollapudi,et al.  An axiomatic approach for result diversification , 2009, WWW '09.

[9]  Marco Pellegrini,et al.  Extraction and classification of dense implicit communities in the Web graph , 2009, TWEB.

[10]  Matthew W. Hoffman,et al.  Predictive Entropy Search for Efficient Global Optimization of Black-box Functions , 2014, NIPS.

[11]  Michalis Faloutsos,et al.  On power-law relationships of the Internet topology , 1999, SIGCOMM '99.

[12]  Ping Li,et al.  0-Bit Consistent Weighted Sampling , 2015, KDD.

[13]  Azer Bestavros,et al.  Self-similarity in World Wide Web traffic: evidence and possible causes , 1996, SIGMETRICS '96.

[14]  Walter Willinger,et al.  On the Self-Similar Nature of Ethernet Traffic ( extended version ) , 1995 .

[15]  W. Rudin Fourier Analysis on Groups: Rudin/Fourier , 1990 .

[16]  Svetlana Lazebnik,et al.  Locality-sensitive binary codes from shift-invariant kernels , 2009, NIPS.

[17]  George Forman,et al.  Efficient detection of large-scale redundancy in enterprise file systems , 2009, OPSR.

[18]  Ping Li,et al.  Generalized Min-Max Kernel and Generalized Consistent Weighted Sampling , 2016, ArXiv.

[19]  Rong Jin,et al.  Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison , 2012, NIPS.

[20]  Éva Tardos,et al.  Approximation algorithms for classification problems with pairwise relationships: metric labeling and Markov random fields , 2002, JACM.

[21]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[22]  Jason Weston,et al.  Large-scale kernel machines , 2007 .

[23]  Ludmila Cherkasova,et al.  Applying syntactic similarity algorithms for enterprise information management , 2009, KDD.

[24]  Ben Taskar,et al.  Approximate Inference in Continuous Determinantal Processes , 2013, NIPS.

[25]  Walter Willinger,et al.  On the self-similar nature of Ethernet traffic , 1993, SIGCOMM '93.

[26]  Sreenivas Gollapudi,et al.  Less is more: sampling the neighborhood graph makes SALSA better and faster , 2009, WSDM '09.

[27]  M. Meerschaert Regular Variation in R k , 1988 .

[28]  E. Nyström Über Die Praktische Auflösung von Integralgleichungen mit Anwendungen auf Randwertaufgaben , 1930 .

[29]  Ping Li,et al.  b-Bit minwise hashing , 2009, WWW '10.

[30]  Ping Li,et al.  Hashing Algorithms for Large-Scale Learning , 2011, NIPS.

[31]  Arthur Gretton,et al.  Fast Two-Sample Testing with Analytic Representations of Probability Measures , 2015, NIPS.

[32]  Ping Li Linearized GMM Kernels and Normalized Random Fourier Features , 2017, KDD.

[33]  W. Rudin,et al.  Fourier Analysis on Groups. , 1965 .

[34]  Sergey Ioffe,et al.  Improved Consistent Sampling, Weighted Minhash and L1 Sketching , 2010, 2010 IEEE International Conference on Data Mining.

[35]  Ben Taskar,et al.  Approximate Inference in Continuous Determinantal Point Processes , 2013, ArXiv.

[36]  Anja Vogler,et al.  An Introduction to Multivariate Statistical Analysis , 2004 .

[37]  W. Bruce Croft,et al.  Finding text reuse on the web , 2009, WSDM '09.

[38]  Kunal Talwar,et al.  Consistent Weighted Sampling , 2007 .

[39]  References , 1971 .

[40]  M. Newman Power laws, Pareto distributions and Zipf's law , 2005 .

[41]  Kevyn Collins-Thompson,et al.  Copulas for information retrieval , 2013, SIGIR.

[42]  Inderjit S. Dhillon,et al.  Fast Prediction for Large-Scale Kernel Machines , 2014, NIPS.

[43]  Zoubin Ghahramani,et al.  Parallel Predictive Entropy Search for Batch Global Optimization of Expensive Objective Functions , 2015, NIPS.

[44]  Ping Li Tunable GMM Kernels , 2017, ArXiv.

[45]  Shou-De Lin,et al.  Sparse Random Feature Algorithm as Coordinate Descent in Hilbert Space , 2014, NIPS.

[46]  Ping Li Nystrom Method for Approximating the GMM Kernel , 2016, ArXiv.

[47]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.