Sketching for Large-Scale Learning of Mixture Models. (Apprentissage de modèles de mélange à large échelle par Sketching)

Learning parameters from voluminous data can be prohibitive in terms of memory and computational requirements. We propose a " compressive learning " framework where we estimate model parameters from a sketch of the training data. This sketch is a collection of generalized moments of the underlying probability distribution of the data. It can be computed in a single pass on the training set, and is easily computable on streams or distributed datasets. The proposed framework shares similarities with compressive sensing, which aims at drastically reducing the dimension of high-dimensional signals while preserving the ability to reconstruct them. To perform the estimation task, we derive an iterative algorithm analogous to sparse reconstruction algorithms in the context of linear inverse problems. We exemplify our framework with the compressive estimation of a Gaussian Mixture Model (GMM), providing heuristics on the choice of the sketching procedure and theoretical guarantees of reconstruction. We experimentally show on synthetic data that the proposed algorithm yields results comparable to the classical Expectation-Maximization (EM) technique while requiring significantly less memory and fewer computations when the number of database elements is large. We further demonstrate the potential of the approach on real large-scale data (over 10 8 training samples) for the task of model-based speaker verification. Finally, we draw some connections between the proposed framework and approximate Hilbert space embedding of probability distributions using random features. We show that the proposed sketching operator can be seen as an innovative method to design translation-invariant kernels adapted to the analysis of GMMs. We also use this theoretical framework to derive information preservation guarantees, in the spirit of infinite-dimensional compressive sensing.

[1]  Bernhard Schölkopf,et al.  Computing functions of random variables via reproducing kernel Hilbert space representations , 2015, Statistics and Computing.

[2]  A. Feuerverger,et al.  On Some Fourier Methods for Inference , 1981 .

[3]  D. L. Donoho,et al.  Compressed sensing , 2006, IEEE Trans. Inf. Theory.

[4]  Larry P. Heck,et al.  MSR Identity Toolbox v1.0: A MATLAB Toolbox for Speaker Recognition Research , 2013 .

[5]  Rémi Gribonval,et al.  Flexible Multilayer Sparse Approximations of Matrices and Applications , 2015, IEEE Journal of Selected Topics in Signal Processing.

[6]  T. Blumensath,et al.  Iterative Thresholding for Sparse Approximations , 2008 .

[7]  R. DeVore,et al.  Compressed sensing and best k-term approximation , 2008 .

[8]  Inderjit S. Dhillon,et al.  Orthogonal Matching Pursuit with Replacement , 2011, NIPS.

[9]  R. Calderbank,et al.  Compressed Learning : Universal Sparse Dimensionality Reduction and Learning in the Measurement Domain , 2009 .

[10]  Marios Hadjieleftheriou,et al.  Methods for finding frequent items in data streams , 2010, The VLDB Journal.

[11]  Rémi Gribonval,et al.  LocOMP: algorithme localement orthogonal pour l'approximation parcimonieuse rapide de signaux longs sur des dictionnaires locaux , 2009 .

[12]  A. Robert Calderbank,et al.  Compressive classification , 2013, 2013 IEEE International Symposium on Information Theory.

[13]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[14]  Bernhard Schölkopf,et al.  Learning from Distributions via Support Measure Machines , 2012, NIPS.

[15]  Thomas Blumensath,et al.  Sampling and Reconstructing Signals From a Union of Linear Subspaces , 2009, IEEE Transactions on Information Theory.

[16]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[17]  Diego P. Ruiz,et al.  Finite mixture of alpha Stable distributions , 2007 .

[18]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[19]  Shuicheng Yan,et al.  SIFT-Bag kernel for video event analysis , 2008, ACM Multimedia.

[20]  Dmitriy Fradkin,et al.  Experiments with random projections for machine learning , 2003, KDD '03.

[21]  Bernhard Schölkopf,et al.  Kernel Measures of Conditional Dependence , 2007, NIPS.

[22]  Dimitris Achlioptas,et al.  Database-friendly random projections , 2001, PODS.

[23]  Bernhard Schölkopf,et al.  A Kernel Method for the Two-Sample-Problem , 2006, NIPS.

[24]  Bernhard Schölkopf,et al.  Hilbert Space Embeddings and Metrics on Probability Measures , 2009, J. Mach. Learn. Res..

[25]  A. Feuerverger,et al.  The Empirical Characteristic Function and Its Applications , 1977 .

[26]  Deyu Meng,et al.  FastMMD: Ensemble of Circular Discrepancy for Efficient Two-Sample Test , 2014, Neural Computation.

[27]  W. Rudin Real and complex analysis , 1968 .

[28]  G. Giannakis,et al.  Modeling And Optimization For Big Data Analytics , 2014 .

[29]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[30]  Le Song,et al.  A Hilbert Space Embedding for Distributions , 2007, Discovery Science.

[31]  Richard G. Baraniuk,et al.  Compressive Sensing , 2008, Computer Vision, A Reference Guide.

[32]  Jorge Nocedal,et al.  A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[33]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[34]  Patrick Pérez,et al.  Compressive Gaussian Mixture estimation , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[35]  J. Florens,et al.  GENERALIZATION OF GMM TO A CONTINUUM OF MOMENT CONDITIONS , 2000, Econometric Theory.

[36]  Bharath K. Sriperumbudur Mixture density estimation via Hilbert space embedding of measures , 2011, 2011 IEEE International Symposium on Information Theory Proceedings.

[37]  A. Tsybakov,et al.  SPADES AND MIXTURE MODELS , 2009, 0901.2044.

[38]  Venkat Chandrasekaran,et al.  Recovery of Sparse Probability Measures via Convex Programming , 2012, NIPS.

[39]  Rémi Munos,et al.  Compressed Least-Squares Regression , 2009, NIPS.

[40]  Xavier Rodet,et al.  Analysis of sound signals with high resolution matching pursuit , 1996, Proceedings of Third International Symposium on Time-Frequency and Time-Scale Analysis (TFTS-96).

[41]  Yohann de Castro,et al.  Exact Reconstruction using Beurling Minimal Extrapolation , 2011, 1103.4951.

[42]  Emmanuel J. Candès,et al.  Near-Optimal Signal Recovery From Random Projections: Universal Encoding Strategies? , 2004, IEEE Transactions on Information Theory.

[43]  Tom Fischer,et al.  Existence, uniqueness, and minimality of the Jordan measure decomposition , 2012 .

[44]  Emmanuel J. Candès,et al.  Decoding by linear programming , 2005, IEEE Transactions on Information Theory.

[45]  E. Candès,et al.  Stable signal recovery from incomplete and inaccurate measurements , 2005, math/0503066.

[46]  Le Song,et al.  A la Carte - Learning Fast Kernels , 2014, AISTATS.

[47]  Robert D. Nowak,et al.  Distilled Sensing: Adaptive Sampling for Sparse Detection and Estimation , 2010, IEEE Transactions on Information Theory.

[48]  Mikhail Belkin,et al.  Polynomial Learning of Distribution Families , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[49]  H. P. Annales de l'Institut Henri Poincaré , 1931, Nature.

[50]  Jean-Pierre Florens,et al.  ON THE ASYMPTOTIC EFFICIENCY OF GMM , 2013, Econometric Theory.

[51]  I. Ntroduction The NIST Year 2005 Speaker Recognition Evaluation Plan 1 , .

[52]  Bernhard Schölkopf,et al.  Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions , 2009, NIPS.

[53]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[54]  Dinghai Xu,et al.  Continuous Empirical Characteristic Function Estimation of Mixtures of Normal Parameters , 2010 .

[55]  Cristian Sminchisescu,et al.  Efficient Match Kernel between Sets of Features for Visual Recognition , 2009, NIPS.

[56]  Rémi Gribonval,et al.  Linear embeddings of low-dimensional subsets of a Hilbert space to Rm , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[57]  Alexander J. Smola,et al.  Fastfood: Approximate Kernel Expansions in Loglinear Time , 2014, ArXiv.

[58]  Sanjoy Dasgupta,et al.  Learning mixtures of Gaussians , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[59]  Alex Smola,et al.  Kernel methods in machine learning , 2007, math/0701907.

[60]  Gert R. G. Lanckriet,et al.  On the empirical estimation of integral probability metrics , 2012 .

[61]  Divesh Srivastava,et al.  Diamond in the rough: finding Hierarchical Heavy Hitters in multi-dimensional data , 2004, SIGMOD '04.

[62]  R.G. Baraniuk,et al.  Compressive Sensing [Lecture Notes] , 2007, IEEE Signal Processing Magazine.

[63]  Eric Moulines,et al.  On‐line expectation–maximization algorithm for latent data models , 2007, ArXiv.

[64]  Y. C. Pati,et al.  Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition , 1993, Proceedings of 27th Asilomar Conference on Signals, Systems and Computers.

[65]  A. Robert Calderbank,et al.  Nonlinear Information-Theoretic Compressive Measurement Design , 2014, ICML.

[66]  Emmanuel J. Candès,et al.  Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information , 2004, IEEE Transactions on Information Theory.

[67]  Emmanuel J. Candès,et al.  Tight Oracle Inequalities for Low-Rank Matrix Recovery From a Minimal Number of Noisy Random Measurements , 2011, IEEE Transactions on Information Theory.

[68]  Mikhail Belkin,et al.  Toward Learning Gaussian Mixtures with Arbitrary Separation , 2010, COLT.

[69]  Andrew Zisserman,et al.  Efficient Additive Kernels via Explicit Feature Maps , 2012, IEEE Trans. Pattern Anal. Mach. Intell..

[70]  Trac D. Tran,et al.  Fast and Efficient Compressive Sensing Using Structurally Random Matrices , 2011, IEEE Transactions on Signal Processing.

[71]  Peter Harremoës,et al.  Refinements of Pinsker's inequality , 2003, IEEE Trans. Inf. Theory.

[72]  E. L. Pennec,et al.  Adaptive Dantzig density estimation , 2009, 0905.0884.

[73]  Christopher R. Taber,et al.  Generalized Method of Moments , 2020, Time Series Analysis.

[74]  Felipe Cucker,et al.  On the mathematical foundations of learning , 2001 .

[75]  S. Muthukrishnan,et al.  How to Summarize the Universe: Dynamic Maintenance of Quantiles , 2002, VLDB.

[76]  Sham M. Kakade,et al.  Learning mixtures of spherical gaussians: moment methods and spectral decompositions , 2012, ITCS '13.

[77]  Patrick Pérez,et al.  Fundamental Performance Limits for Ideal Decoders in High-Dimensional Linear Inverse Problems , 2013, IEEE Transactions on Information Theory.

[78]  Barnabás Póczos,et al.  Linear-Time Learning on Distributions with Approximate Kernel Embeddings , 2015, AAAI.

[79]  Kien C. Tran Estimating mixtures of normal distributions via empirical characteristic function , 1998 .

[80]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[81]  R. DeVore,et al.  A Simple Proof of the Restricted Isometry Property for Random Matrices , 2008 .

[82]  Mike E. Davies,et al.  Iterative Hard Thresholding for Compressed Sensing , 2008, ArXiv.

[83]  Peter Ahrendt,et al.  The Multivariate Gaussian Probability Distribution , 2005 .

[84]  Holger Rauhut,et al.  A Mathematical Introduction to Compressive Sensing , 2013, Applied and Numerical Harmonic Analysis.

[85]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[86]  Guillermo Sapiro,et al.  Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy? , 2015, IEEE Transactions on Signal Processing.

[87]  Barnabás Póczos,et al.  Deep Mean Maps , 2015, ArXiv.

[88]  Jean-Pierre Florens,et al.  Efficient GMM Estimation Using the Empirical Characteristic Function , 2002 .

[89]  Sudipto Guha,et al.  Dynamic multidimensional histograms , 2002, SIGMOD '02.

[90]  Mike E. Davies,et al.  Sampling Theorems for Signals From the Union of Finite-Dimensional Linear Subspaces , 2009, IEEE Transactions on Information Theory.

[91]  Michael Elad,et al.  The Cosparse Analysis Model and Algorithms , 2011, ArXiv.

[92]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[93]  Kenji Fukumizu,et al.  Recovering Distributions from Gaussian RKHS Embeddings , 2014, AISTATS.

[94]  W. Rudin,et al.  Fourier Analysis on Groups. , 1965 .

[95]  Hans-Peter Kriegel,et al.  Integrating structured biological data by Kernel Maximum Mean Discrepancy , 2006, ISMB.

[96]  Alfred O. Hero,et al.  Optimal Two-Stage Search for Sparse Targets Using Convex Criteria , 2008, IEEE Transactions on Signal Processing.

[97]  AI Koan,et al.  Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[98]  O. Cappé,et al.  On‐line expectation–maximization algorithm for latent data models , 2009 .

[99]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[100]  Yonina C. Eldar,et al.  Compressed Sensing with Coherent and Redundant Dictionaries , 2010, ArXiv.

[101]  Peng Zhang,et al.  Matrix Multiplication on High-Density Multi-GPU Architectures: Theoretical and Experimental Investigations , 2015, ISC.