Provable Algorithms for Machine Learning Problems

Modern machine learning algorithms can extract useful information from text, images and videos. All these applications involve solving NP-hard problems in average case using heuristics. What properties of the input allow it to be solved efficiently? Theoretically analyzing the heuristics is often very challenging. Few results were known. This thesis takes a different approach: we identify natural properties of the input, then design new algorithms that provably works assuming the input has these properties. We are able to give new, provable and sometimes practical algorithms for learning tasks related to text corpus, images and social networks. The first part of the thesis presents new algorithms for learning thematic structure in documents. We show under a reasonable assumption, it is possible to provably learn many topic models, including the famous Latent Dirichlet Allocation. Our algorithm is the first provable algorithms for topic modeling. An implementation runs 50 times faster than latest MCMC implementation and produces comparable results. The second part of the thesis provides ideas for provably learning deep, sparse representations. We start with sparse linear representations, and give the first algorithm for dictionary learning problem with provable guarantees. Then we apply similar ideas to deep learning: under reasonable assumptions our algorithms can learn a deep network built by denoising autoencoders. The final part of the thesis develops a framework for learning latent variable models. We demonstrate how various latent variable models can be reduced to orthogonal tensor decomposition, and then be solved using tensor power method. We give a tight perturbation analysis for tensor power method, which reduces the number of samples required to learn many latent variable models. In theory, the assumptions in this thesis help us understand why intractable problems in machine learning can often be solved; in practice, the results suggest inherently new approaches for machine learning. We hope the assumptions and algorithms inspire new research problems and learning algorithms.

[1]  Kathryn B. Laskey,et al.  Stochastic blockmodels: First steps , 1983 .

[2]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[3]  Jon M. Kleinberg,et al.  Using mixture models for collaborative filtering , 2004, STOC '04.

[4]  Tamara G. Kolda,et al.  Tensor Decompositions and Applications , 2009, SIAM Rev..

[5]  David M. Blei,et al.  Introduction to Probabilistic Topic Models , 2010 .

[6]  Richard A. Harshman,et al.  Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-model factor analysis , 1970 .

[7]  Huan Wang,et al.  Exact Recovery of Sparsely-Used Dictionaries , 2012, COLT.

[8]  Alexandra Kolla,et al.  How to Play Unique Games Against a Semi-random Adversary: Study of Semi-random Models of Unique Games , 2011, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[9]  Stphane Mallat,et al.  A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way , 2008 .

[10]  Thomas S. Huang,et al.  Image Super-Resolution Via Sparse Representation , 2010, IEEE Transactions on Image Processing.

[11]  Anima Anandkumar,et al.  A Spectral Algorithm for Latent Dirichlet Allocation , 2012, Algorithmica.

[12]  José M. Bioucas-Dias,et al.  Vertex component analysis: a fast algorithm to unmix hyperspectral data , 2005, IEEE Transactions on Geoscience and Remote Sensing.

[13]  Joos Vandewalle,et al.  A Multilinear Singular Value Decomposition , 2000, SIAM J. Matrix Anal. Appl..

[14]  Gene H. Golub,et al.  Symmetric Tensors and Symmetric Tensor Rank , 2008, SIAM J. Matrix Anal. Appl..

[15]  Phillip A. Regalia,et al.  On the Best Rank-1 Approximation of Higher-Order Supersymmetric Tensors , 2001, SIAM J. Matrix Anal. Appl..

[16]  Tamara G. Kolda,et al.  Orthogonal Tensor Decompositions , 2000, SIAM J. Matrix Anal. Appl..

[17]  Sanjeev Arora,et al.  Learning Topic Models -- Going beyond SVD , 2012, 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science.

[18]  Aravindan Vijayaraghavan,et al.  Approximation algorithms for semi-random partitioning problems , 2012, STOC '12.

[19]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[20]  Nicolas Gillis,et al.  Robust near-separable nonnegative matrix factorization using linear optimization , 2013, J. Mach. Learn. Res..

[21]  Piotr Indyk,et al.  Combining geometry and combinatorics: A unified approach to sparse signal recovery , 2008, 2008 46th Annual Allerton Conference on Communication, Control, and Computing.

[22]  Michael J. Freedman,et al.  Scalable Inference of Overlapping Communities , 2012, NIPS.

[23]  Anima Anandkumar,et al.  Learning Mixtures of Tree Graphical Models , 2012, NIPS.

[24]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[25]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[26]  M. Jackson,et al.  An Economic Model of Friendship: Homophily, Minorities and Segregation , 2007 .

[27]  Pierre Comon,et al.  Independent component analysis, a survey of some algebraic methods , 1996, 1996 IEEE International Symposium on Circuits and Systems. Circuits and Systems Connecting the World. ISCAS 96.

[28]  L. Lathauwer,et al.  On the Best Rank-1 and Rank-( , 2004 .

[29]  Stephen A. Vavasis,et al.  On the Complexity of Nonnegative Matrix Factorization , 2007, SIAM J. Optim..

[30]  L. Lau Bipartite roots of graphs , 2004, SODA '04.

[31]  Maria-Florina Balcan,et al.  Clustering under approximation stability , 2013, JACM.

[32]  Edoardo M. Airoldi,et al.  Mixed Membership Stochastic Blockmodels , 2007, NIPS.

[33]  Gene H. Golub,et al.  Rank-One Approximation to High Order Tensors , 2001, SIAM J. Matrix Anal. Appl..

[34]  Michael E. Saks,et al.  Communication Complexity and Combinatorial Lattice Theory , 1993, J. Comput. Syst. Sci..

[35]  Lek-Heng Lim,et al.  Singular values and eigenvalues of tensors: a variational approach , 2005, 1st IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing, 2005..

[36]  Santosh S. Vempala,et al.  Isotropic PCA and Affine-Invariant Clustering , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[37]  Michael Elad,et al.  Sparse and Redundant Representations - From Theory to Applications in Signal and Image Processing , 2010 .

[38]  David Gross,et al.  Recovering Low-Rank Matrices From Few Coefficients in Any Basis , 2009, IEEE Transactions on Information Theory.

[39]  John D. Lafferty,et al.  A correlated topic model of Science , 2007, 0708.3601.

[40]  Sanjeev Arora,et al.  Finding overlapping communities in social networks: toward a rigorous approach , 2011, EC '12.

[41]  N. Nisan Lower Bounds for Non-Commutative Computation (Extended Abstract) , 1991, STOC 1991.

[42]  A. Bruckstein,et al.  K-SVD : An Algorithm for Designing of Overcomplete Dictionaries for Sparse Representation , 2005 .

[43]  Dima Grigoriev,et al.  Solving Systems of Polynomial Inequalities in Subexponential Time , 1988, J. Symb. Comput..

[44]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[45]  Noga Alon,et al.  Separable Partitions , 1999, Discret. Appl. Math..

[46]  Massimiliano Pontil,et al.  Multi-Task Feature Learning , 2006, NIPS.

[47]  Elchanan Mossel,et al.  Learning nonsingular phylogenies and hidden Markov models , 2005, STOC '05.

[48]  Ankur Moitra,et al.  Settling the Polynomial Learnability of Mixtures of Gaussians , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[49]  Nicolas Gillis,et al.  Robustness Analysis of Hottopixx, a Linear Programming Model for Factoring Nonnegative Matrices , 2012, SIAM J. Matrix Anal. Appl..

[50]  Anima Anandkumar,et al.  A Method of Moments for Mixture Models and Hidden Markov Models , 2012, COLT.

[51]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[52]  Santosh S. Vempala,et al.  Fourier PCA , 2013, ArXiv.

[53]  Venkatesh Saligrama,et al.  Topic Discovery through Data Dependent and Random Projections , 2013, ICML.

[54]  Joel H. Spencer,et al.  Coloring Random and Semi-Random k-Colorable Graphs , 1995, J. Algorithms.

[55]  D. Donoho,et al.  Uncertainty principles and signal recovery , 1989 .

[56]  Pierre Comon,et al.  Handbook of Blind Source Separation: Independent Component Analysis and Applications , 2010 .

[57]  C. Gomez,et al.  N‐FindR method versus independent component analysis for lithological identification in hyperspectral imagery , 2007 .

[58]  Mark Braverman,et al.  Finding Endogenously Formed Communities , 2012, SODA.

[59]  S. Mallat,et al.  Adaptive greedy approximations , 1997 .

[60]  P. Wedin Perturbation bounds in connection with singular value decomposition , 1972 .

[61]  Sham M. Kakade,et al.  Learning mixtures of spherical gaussians: moment methods and spectral decompositions , 2012, ITCS '13.

[62]  Mihai Patrascu,et al.  On the possibility of faster SAT algorithms , 2010, SODA '10.

[63]  Russell Impagliazzo,et al.  Complexity of k-SAT , 1999, Proceedings. Fourteenth Annual IEEE Conference on Computational Complexity (Formerly: Structure in Complexity Theory Conference) (Cat.No.99CB36317).

[64]  Sanjoy Dasgupta,et al.  Learning mixtures of Gaussians , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[65]  Jiri Matousek,et al.  Lectures on discrete geometry , 2002, Graduate texts in mathematics.

[66]  L. Tucker,et al.  Some mathematical notes on three-mode factor analysis , 1966, Psychometrika.

[67]  Frank McSherry,et al.  Spectral partitioning of random graphs , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[68]  Ankur Moitra An Almost Optimal Algorithm for Computing Nonnegative Rank , 2013, SODA.

[69]  Joseph T. Chang,et al.  Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. , 1996, Mathematical biosciences.

[70]  Michael Elad,et al.  Image Denoising Via Sparse and Redundant Representations Over Learned Dictionaries , 2006, IEEE Transactions on Image Processing.

[71]  Sanjeev Arora,et al.  Learning mixtures of arbitrary gaussians , 2001, STOC '01.

[72]  Wray L. Buntine Estimating Likelihoods for Topic Models , 2009, ACML.

[73]  F. T. Wright,et al.  A Bound on Tail Probabilities for Quadratic Forms in Independent Random Variables , 1971 .

[74]  Mihalis Yannakakis,et al.  Expressing combinatorial optimization problems by linear programs , 1991, STOC '88.

[75]  Vikas Sindhwani,et al.  Fast Conical Hull Algorithms for Near-separable Non-negative Matrix Factorization , 2012, ICML.

[76]  M. McPherson,et al.  Birds of a Feather: Homophily in Social Networks , 2001 .

[77]  Rajeev Motwani,et al.  Computing Roots of Graphs Is Hard , 1994, Discret. Appl. Math..

[78]  Alan M. Frieze,et al.  Learning linear transformations , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[79]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[80]  Tim Austin On exchangeable random variables and the statistics of large graphs and hypergraphs , 2008, 0801.1698.

[81]  Kjersti Engan,et al.  Method of optimal directions for frame design , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[82]  Santosh S. Vempala,et al.  A spectral algorithm for learning mixture models , 2004, J. Comput. Syst. Sci..

[83]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[84]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[85]  Martial Hebert,et al.  Discriminative Sparse Image Models for Class-Specific Edge Detection and Image Interpretation , 2008, ECCV.

[86]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[87]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[88]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[89]  M. Rudelson Random Vectors in the Isotropic Position , 1996, math/9608208.

[90]  E. A. Sylvestre,et al.  Self Modeling Curve Resolution , 1971 .

[91]  S. Muthukrishnan,et al.  Approximation of functions over redundant dictionaries using coherence , 2003, SODA '03.

[92]  Emmanuel J. Candès,et al.  Decoding by linear programming , 2005, IEEE Transactions on Information Theory.

[93]  E. Candès,et al.  Stable signal recovery from incomplete and inaccurate measurements , 2005, math/0503066.

[94]  G. Buchsbaum,et al.  Color categories revealed by non-negative matrix factorization of Munsell color spectra , 2002, Vision Research.

[95]  Marie-Françoise Roy,et al.  On the combinatorial and algebraic complexity of Quanti erEliminationS , 1994 .

[96]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[97]  Marc'Aurelio Ranzato,et al.  Fast Inference in Sparse Coding Algorithms with Applications to Object Recognition , 2010, ArXiv.

[98]  K. Pearson Contributions to the Mathematical Theory of Evolution , 1894 .

[99]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[100]  Sanjeev Arora,et al.  Computing a nonnegative matrix factorization -- provably , 2011, STOC '12.

[101]  Yi Ma,et al.  The Augmented Lagrange Multiplier Method for Exact Recovery of Corrupted Low-Rank Matrices , 2010, Journal of structural biology.

[102]  T. Snijders,et al.  Estimation and Prediction for Stochastic Blockmodels for Graphs with Latent Block Structure , 1997 .

[103]  M. M. Meyer,et al.  Statistical Analysis of Multiple Sociometric Relations. , 1985 .

[104]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[105]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[106]  Yoshua Bengio,et al.  Large-Scale Feature Learning With Spike-and-Slab Sparse Coding , 2012, ICML.

[107]  S. Frick,et al.  Compressed Sensing , 2014, Computer Vision, A Reference Guide.

[108]  Uriel Feige,et al.  Heuristics for Semirandom Graph Problems , 2001, J. Comput. Syst. Sci..

[109]  Marc'Aurelio Ranzato,et al.  Sparse Feature Learning for Deep Belief Networks , 2007, NIPS.

[110]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[111]  Joel E. Cohen,et al.  Nonnegative ranks, decompositions, and factorizations of nonnegative matrices , 1993 .

[112]  Victoria Stodden,et al.  When Does Non-Negative Matrix Factorization Give a Correct Decomposition into Parts? , 2003, NIPS.

[113]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[114]  S. Boorman,et al.  Social structure from multiple networks: I , 1976 .

[115]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[116]  Lieven De Lathauwer,et al.  Fourth-Order Cumulant-Based Blind Identification of Underdetermined Mixtures , 2007, IEEE Transactions on Signal Processing.

[117]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[118]  Richard G. Baraniuk,et al.  1-Bit compressive sensing , 2008, 2008 42nd Annual Conference on Information Sciences and Systems.

[119]  P. Lazarsfeld,et al.  Friendship as Social process: a substantive and methodological analysis , 1964 .

[120]  Tamara G. Kolda,et al.  Shifted Power Method for Computing Tensor Eigenpairs , 2010, SIAM J. Matrix Anal. Appl..

[121]  Roi Livni,et al.  A Provably Efficient Algorithm for Training Deep Networks , 2013, ArXiv.

[122]  Xiaoming Huo,et al.  Uncertainty principles and ideal atomic decomposition , 2001, IEEE Trans. Inf. Theory.

[123]  L. Henry,et al.  Schémas de nuptialité : déséquilibre des sexes et célibat , 1969 .

[124]  Joel A. Tropp,et al.  Factoring nonnegative matrices with linear programs , 2012, NIPS.

[125]  Sham M. Kakade,et al.  Identifiability and Unmixing of Latent Parse Trees , 2012, NIPS.

[126]  James Renegar On the computational complexity and geome-try of the first-order theory of the reals , 1992 .

[127]  Uriel G. Rothblum,et al.  On the number of separable partitions , 2011, J. Comb. Optim..

[128]  T. Vicsek,et al.  Uncovering the overlapping community structure of complex networks in nature and society , 2005, Nature.

[129]  Lawrence K. Saul,et al.  Kernel Methods for Deep Learning , 2009, NIPS.

[130]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[131]  Pierre Comon,et al.  Subtracting a best rank-1 approximation may increase tensor rank , 2009, 2009 17th European Signal Processing Conference.

[132]  Alexander A. Sherstov,et al.  Cryptographic Hardness for Learning Intersections of Halfspaces , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[133]  Terrence J. Sejnowski,et al.  Learning Overcomplete Representations , 2000, Neural Computation.

[134]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[135]  Daniel M. Roy,et al.  Complexity of Inference in Latent Dirichlet Allocation , 2011, NIPS.

[136]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[137]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[138]  Sham M. Kakade,et al.  A spectral algorithm for learning Hidden Markov Models , 2008, J. Comput. Syst. Sci..

[139]  Venkatesan Guruswami,et al.  Expander-based constructions of efficiently decodable codes , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[140]  S. Muthukrishnan,et al.  Improved sparse approximation over quasiincoherent dictionaries , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[141]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[142]  Sujay Sanghavi,et al.  Clustering Sparse Graphs , 2012, NIPS.

[143]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[144]  David Buttler,et al.  Exploring Topic Coherence over Many Models and Many Topics , 2012, EMNLP.

[145]  Andrew McCallum,et al.  Efficient methods for topic model inference on streaming document collections , 2009, KDD.

[146]  Yuchung J. Wang,et al.  Stochastic Blockmodels for Directed Graphs , 1987 .

[147]  Anima Anandkumar,et al.  Fast Detection of Overlapping Communities via Online Tensor Methods on GPUs , 2013, ArXiv.

[148]  Sanjeev Arora,et al.  New Algorithms for Learning Incoherent and Overcomplete Dictionaries , 2013, COLT.

[149]  Alfred V. Aho,et al.  On notions of information transfer in VLSI circuits , 1983, STOC.

[150]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[151]  Leonid Khachiyan,et al.  On the Complexity of Approximating Extremal Determinants in Matrices , 1995, J. Complex..

[152]  Joel A. Tropp,et al.  Greed is good: algorithmic results for sparse approximation , 2004, IEEE Transactions on Information Theory.

[153]  Robert Krauthgamer,et al.  Finding and certifying a large hidden clique in a semirandom graph , 2000, Random Struct. Algorithms.

[154]  E. Harding The number of partitions of a set of N points in k dimensions induced by hyperplanes , 1967, Proceedings of the Edinburgh Mathematical Society.