ISIT 2015 Tutorial: Information Theory and Machine Learning

We are in the midst of a data deluge, with an explosion in the volume and richness of data sets in elds including social networks, biology, natural language processing, and computer vision, among others. In all of these areas, machine learning has been extraordinarily successful in providing tools and practical algorithms for extracting information from massive data sets (e.g., genetics, multi-spectral imaging, Google and FaceBook). Despite this tremendous practical success, relatively less attention has been paid to fundamental limits and tradeos, and information theory has a crucial role to play in this context. The goal of this tutorial is to demonstrate how information-theoretic techniques and concepts can be brought to bear on machine learning problems in unorthodox and fruitful ways. We discuss how any learning problem can be formalized in a Shannon-theoretic sense, albeit one that involves non-traditional notions of codewords and channels. This perspective allows information-theoretic tools|including information measures, Fano’s inequality, random coding arguments, and so on|to be brought to bear on learning problems. We illustrate this broad perspective with discussions of several learning problems, including sparse approximation, dimensionality reduction, graph recovery, clustering, and community detection. We emphasise recent results establishing the fundamental limits of graphical model learning and community detection. We also discuss the distinction between the learning-theoretic capacity when arbitrary \decoding" algorithms are allowed, and notions of computationally-constrained capacity. Finally, a number of open problems and conjectures at the interface of information theory and machine learning will be discussed.

[1]  Amit Singer,et al.  Linear inverse problems on Erdős-Rényi graphs: Information-theoretic limits and efficient recovery , 2014, 2014 IEEE International Symposium on Information Theory.

[2]  S. Geer,et al.  High-dimensional additive modeling , 2008, 0806.4115.

[3]  Elchanan Mossel,et al.  Reconstruction of Markov Random Fields from Samples: Some Observations and Algorithms , 2007, SIAM J. Comput..

[4]  Luther Pfahler Eisenhart American Mathematical Society Colloquium Publications, Volume Viii , 1927 .

[5]  Emmanuel Abbe,et al.  Community Detection in General Stochastic Block models: Fundamental Limits and Efficient Algorithms for Recovery , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[6]  Christian Borgs,et al.  Private Graphon Estimation for Sparse Graphs , 2015, NIPS.

[7]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[8]  J. Besag Statistical Analysis of Non-Lattice Data , 1975 .

[9]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[10]  A. Pinkus n-Widths in Approximation Theory , 1985 .

[11]  S. Boorman,et al.  Social structure from multiple networks: I , 1976 .

[12]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[13]  Assaf Naor,et al.  Rigorous location of phase transitions in hard optimization problems , 2005, Nature.

[14]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[15]  Sujay Sanghavi,et al.  Clustering Sparse Graphs , 2012, NIPS.

[16]  M. Ledoux The concentration of measure phenomenon , 2001 .

[17]  Laurent Massoulié,et al.  Community Detection in the Labelled Stochastic Block Model , 2012, ArXiv.

[18]  D. Donoho For most large underdetermined systems of equations, the minimal 𝓁1‐norm near‐solution approximates the sparsest near‐solution , 2006 .

[19]  S. Geer Empirical Processes in M-Estimation , 2000 .

[20]  Kathryn B. Laskey,et al.  Stochastic blockmodels: First steps , 1983 .

[21]  Shlomo Shamai,et al.  Mutual information and minimum mean-square error in Gaussian channels , 2004, IEEE Transactions on Information Theory.

[22]  Y. Ritov,et al.  Persistence in high-dimensional linear predictor selection and the virtue of overparametrization , 2004 .

[23]  L. Birge Estimating a Density under Order Restrictions: Nonasymptotic Minimax Risk , 1987 .

[24]  Tim Roughgarden,et al.  Tight Error Bounds for Structured Prediction , 2014, ArXiv.

[25]  Mark E. J. Newman,et al.  Stochastic blockmodels and community structure in networks , 2010, Physical review. E, Statistical, nonlinear, and soft matter physics.

[26]  Leonidas J. Guibas,et al.  Near-Optimal Joint Object Matching via Convex Relaxation , 2014, ICML.

[27]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[28]  Lada A. Adamic,et al.  The political blogosphere and the 2004 U.S. election: divided they blog , 2005, LinkKDD '05.

[29]  S. A. Sherman,et al.  Providence , 1906 .

[30]  R. Tibshirani,et al.  Generalized Additive Models , 1991 .

[31]  Laurent Massoulié,et al.  Community detection thresholds and the weak Ramanujan property , 2013, STOC.

[32]  M. Wainwright,et al.  High-dimensional analysis of semidefinite relaxations for sparse principal components , 2008, 2008 IEEE International Symposium on Information Theory.

[33]  T. W. Anderson,et al.  An Introduction to Multivariate Statistical Analysis , 1959 .

[34]  Russell Impagliazzo,et al.  Hill-climbing finds random planted bisections , 2001, SODA '01.

[35]  Bin Yu,et al.  Spectral clustering and the high-dimensional stochastic blockmodel , 2010, 1007.1684.

[36]  Marion Kee,et al.  Analysis , 2004, Machine Translation.

[37]  Amin Coja-Oghlan,et al.  Graph Partitioning via Adaptive Spectral Techniques , 2009, Combinatorics, Probability and Computing.

[38]  Sundeep Rangan,et al.  Necessary and Sufficient Conditions for Sparsity Pattern Recovery , 2008, IEEE Transactions on Information Theory.

[39]  László Lovász,et al.  Large Networks and Graph Limits , 2012, Colloquium Publications.

[40]  Frank McSherry,et al.  Spectral partitioning of random graphs , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[41]  Elchanan Mossel,et al.  Belief propagation, robust reconstruction and optimal recovery of block models , 2013, COLT.

[42]  D. Eisenberg,et al.  Detecting protein function and protein-protein interactions from genome sequences. , 1999, Science.

[43]  Andrea Montanari,et al.  Which graphical models are difficult to learn? , 2009, NIPS.

[44]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[45]  Greg Linden,et al.  Amazon . com Recommendations Item-to-Item Collaborative Filtering , 2001 .

[46]  Y. Peres,et al.  Broadcasting on trees and the Ising model , 2000 .

[47]  Vincent Y. F. Tan,et al.  High-dimensional structure estimation in Ising models: Local separation criterion , 2011, 1107.1736.

[48]  Andrea Montanari,et al.  Finding One Community in a Sparse Graph , 2015, Journal of Statistical Physics.

[49]  Edoardo M. Airoldi,et al.  Stochastic blockmodels with growing number of classes , 2010, Biometrika.

[50]  V. Koltchinskii,et al.  SPARSITY IN MULTIPLE KERNEL LEARNING , 2010, 1211.2998.

[51]  D. Donoho,et al.  Geometrizing Rates of Convergence, III , 1991 .

[52]  P. Spirtes,et al.  Causation, prediction, and search , 1993 .

[53]  Olgica Milenkovic,et al.  Correlation Clustering with Constrained Cluster Sizes and Extended Weights Bounds , 2014, SIAM J. Optim..

[54]  Cristopher Moore,et al.  Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications , 2011, Physical review. E, Statistical, nonlinear, and soft matter physics.

[55]  Andrea Montanari,et al.  Conditional Random Fields, Planted Constraint Satisfaction and Entropy Concentration , 2013, APPROX-RANDOM.

[56]  Andrea Montanari,et al.  Information-theoretically optimal sparse PCA , 2014, 2014 IEEE International Symposium on Information Theory.

[57]  Yudong Chen,et al.  Statistical-Computational Tradeoffs in Planted Problems and Submatrix Localization with a Growing Number of Clusters and Submatrices , 2014, J. Mach. Learn. Res..

[58]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[59]  Martin J. Wainwright,et al.  Lower bounds on the performance of polynomial-time algorithms for sparse linear regression , 2014, COLT.

[60]  Richard M. Karp,et al.  Algorithms for graph partitioning on the planted partition model , 2001, Random Struct. Algorithms.

[61]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[62]  Béla Bollobás,et al.  Max Cut for Random Graphs with a Planted Partition , 2004, Combinatorics, Probability and Computing.

[63]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[64]  Emmanuel Abbe,et al.  Recovering Communities in the General Stochastic Block Model Without Knowing the Parameters , 2015, NIPS.

[65]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[66]  Michael I. Jordan,et al.  Optimal prediction for sparse linear models? Lower bounds for coordinate-separable M-estimators , 2015, 1503.03188.

[67]  Festschrift: In honor of Lee Lusted. , 1991, Medical decision making : an international journal of the Society for Medical Decision Making.

[68]  Andrea J. Goldsmith,et al.  Information Recovery From Pairwise Measurements , 2015, IEEE Transactions on Information Theory.

[69]  I. Jolliffe Principal Component Analysis , 2002 .

[70]  Yuchung J. Wang,et al.  Stochastic Blockmodels for Directed Graphs , 1987 .

[71]  Martin J. Wainwright,et al.  Minimax Rates of Estimation for High-Dimensional Linear Regression Over $\ell_q$ -Balls , 2009, IEEE Transactions on Information Theory.

[72]  Philippe Rigollet,et al.  Computational Lower Bounds for Sparse PCA , 2013, ArXiv.

[73]  C. J. Stone,et al.  Additive Regression and Other Nonparametric Models , 1985 .

[74]  Lucien Birgé Approximation dans les espaces métriques et théorie de l'estimation , 1983 .

[75]  Andrea Montanari,et al.  Finding Hidden Cliques of Size $$\sqrt{N/e}$$N/e in Nearly Linear Time , 2013, Found. Comput. Math..

[76]  Imre Csisz'ar,et al.  Consistent estimation of the basic neighborhood of Markov random fields , 2006, math/0605323.

[77]  Ravi B. Boppana,et al.  Eigenvalues and graph bisection: An average-case analysis , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[78]  Andrea Montanari,et al.  Finding Hidden Cliques of Size \sqrt{N/e} in Nearly Linear Time , 2013, ArXiv.

[79]  Martin J. Wainwright,et al.  Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming , 2010, J. Mach. Learn. Res..

[80]  D. Donoho For most large underdetermined systems of linear equations the minimal 𝓁1‐norm solution is also the sparsest solution , 2006 .

[81]  Vahid Tarokh,et al.  Shannon-Theoretic Limits on Noisy Compressive Sampling , 2007, IEEE Transactions on Information Theory.

[82]  Florent Krzakala,et al.  Spectral detection in the censored block model , 2015, 2015 IEEE International Symposium on Information Theory (ISIT).

[83]  D. Donoho,et al.  Geometrizing Rates of Convergence , II , 2008 .

[84]  P. Eggermont,et al.  Maximum penalized likelihood estimation , 2001 .

[85]  E. Ising Beitrag zur Theorie des Ferromagnetismus , 1925 .

[86]  Martin E. Dyer,et al.  The Solution of Some Random NP-Hard Problems in Polynomial Expected Time , 1989, J. Algorithms.

[87]  Aditya Guntuboyina Lower Bounds for the Minimax Risk Using $f$-Divergences, and Applications , 2011, IEEE Transactions on Information Theory.

[88]  Martin J. Wainwright,et al.  Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting , 2009, IEEE Trans. Inf. Theory.

[89]  T. Snijders,et al.  Estimation and Prediction for Stochastic Blockmodels for Graphs with Latent Block Structure , 1997 .

[90]  Frank Thomson Leighton,et al.  Graph bisection algorithms with good average case behavior , 1984, Comb..

[91]  M. M. Meyer,et al.  Statistical Analysis of Multiple Sociometric Relations. , 1985 .

[92]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[93]  Martin S. Kochmanski NOTE ON THE E. ISING'S PAPER ,,BEITRAG ZUR THEORIE DES FERROMAGNETISMUS" (Zs. Physik, 31, 253 (1925)) , 2008 .

[94]  Cynthia Rudin,et al.  Discovery with Data: Leveraging Statistics with Computer Science to Transform Science and Society , 2014 .

[95]  Peter Bühlmann,et al.  Estimating High-Dimensional Directed Acyclic Graphs with the PC-Algorithm , 2007, J. Mach. Learn. Res..

[96]  Jingchun Chen,et al.  Detecting functional modules in the yeast protein-protein interaction network , 2006, Bioinform..

[97]  R. Srikant,et al.  Jointly clustering rows and columns of binary matrices: algorithms and trade-offs , 2013, SIGMETRICS '14.

[98]  J. Lafferty,et al.  High-dimensional Ising model selection using ℓ1-regularized logistic regression , 2010, 1010.0311.

[99]  Yihong Wu,et al.  Computational Barriers in Minimax Submatrix Detection , 2013, ArXiv.

[100]  Sanjay Shakkottai,et al.  Greedy learning of Markov network structure , 2010, 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[101]  Milan Sonka,et al.  Image Processing, Analysis and Machine Vision , 1993, Springer US.

[102]  I. Ibragimov,et al.  On density estimation in the view of Kolmogorov's ideas in approximation theory , 1990 .

[103]  Bruce E. Hajek,et al.  Achieving exact cluster recovery threshold via semidefinite programming , 2015, 2015 IEEE International Symposium on Information Theory (ISIT).

[104]  Emmanuel Abbe,et al.  Exact Recovery in the Stochastic Block Model , 2014, IEEE Transactions on Information Theory.

[105]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[106]  Edoardo M. Airoldi,et al.  Stochastic blockmodel approximation of a graphon: Theory and consistent estimation , 2013, NIPS.

[107]  Anup Rao,et al.  Stochastic Block Model and Community Detection in Sparse Graphs: A spectral algorithm with optimal rate of recovery , 2015, COLT.

[108]  Mark Newman,et al.  Networks: An Introduction , 2010 .

[109]  Avrim Blum,et al.  Correlation Clustering , 2004, Machine Learning.

[110]  Roman Vershynin,et al.  Community detection in sparse networks via Grothendieck’s inequality , 2014, Probability Theory and Related Fields.

[111]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[112]  S H Strogatz,et al.  Random graph models of social networks , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[113]  Emmanuel Abbe,et al.  Community detection in general stochastic block models: fundamental limits and efficient recovery algorithms , 2015, ArXiv.

[114]  S. Boorman,et al.  Social Structure from Multiple Networks. I. Blockmodels of Roles and Positions , 1976, American Journal of Sociology.

[115]  Martin J. Wainwright,et al.  Sharp Thresholds for High-Dimensional and Noisy Sparsity Recovery Using $\ell _{1}$ -Constrained Quadratic Programming (Lasso) , 2009, IEEE Transactions on Information Theory.

[116]  Edoardo M. Airoldi,et al.  Mixed Membership Stochastic Blockmodels , 2007, NIPS.

[117]  Laurent Massoulié,et al.  Edge Label Inference in Generalized Stochastic Block Models: from Spectral Theory to Impossibility Results , 2014, COLT.

[118]  Politis,et al.  [Springer Series in Statistics] Subsampling || Subsampling for Stationary Time Series , 1999 .

[119]  A. Kolmogorov,et al.  Entropy and "-capacity of sets in func-tional spaces , 1961 .

[120]  David R. Karger,et al.  Learning Markov networks: maximum bounded tree-width graphs , 2001, SODA '01.

[121]  David M Blei,et al.  Efficient discovery of overlapping communities in massive networks , 2013, Proceedings of the National Academy of Sciences.

[122]  Amit Singer,et al.  Decoding Binary Node Labels from Censored Edge Measurements: Phase Transition and Efficient Recovery , 2014, IEEE Transactions on Network Science and Engineering.

[123]  L. Lecam Convergence of Estimates Under Dimensionality Restrictions , 1973 .

[124]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[125]  Alexandre Proutière,et al.  Accurate Community Detection in the Stochastic Block Model via Spectral Algorithms , 2014, ArXiv.

[126]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[127]  P. Bickel,et al.  A nonparametric view of network models and Newman–Girvan and other modularities , 2009, Proceedings of the National Academy of Sciences.

[128]  Martin J. Wainwright,et al.  A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers , 2009, NIPS.

[129]  Andrea Montanari,et al.  Asymptotic Mutual Information for the Two-Groups Stochastic Block Model , 2015, ArXiv.

[130]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[131]  K. Fernow New York , 1896, American Potato Journal.

[132]  Yuhong Yang,et al.  Information-theoretic determination of minimax rates of convergence , 1999 .

[133]  W. Härdle Nonparametric and Semiparametric Models , 2004 .

[134]  Elizaveta Levina,et al.  On semidefinite relaxations for the block model , 2014, ArXiv.

[135]  Michael I. Jordan,et al.  A Direct Formulation for Sparse Pca Using Semidefinite Programming , 2004, NIPS 2004.

[136]  I. Johnstone,et al.  On Consistency and Sparsity for Principal Components Analysis in High Dimensions , 2009, Journal of the American Statistical Association.

[137]  I. Johnstone On the distribution of the largest eigenvalue in principal components analysis , 2001 .

[138]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[139]  Leonidas J. Guibas,et al.  Consistent Shape Maps via Semidefinite Programming , 2013, SGP '13.

[140]  Larry A. Wasserman,et al.  SpAM: Sparse Additive Models , 2007, NIPS.

[141]  R. Tibshirani,et al.  Linear Smoothers and Additive Models , 1989 .

[142]  S. Geer,et al.  On the conditions used to prove oracle results for the Lasso , 2009, 0910.0722.

[143]  P. Pakzad,et al.  Phase Transitions for Mutual Information , 2010, 2010 6th International Symposium on Turbo Codes & Iterative Information Processing.

[144]  Emmanuel J. Candès,et al.  Decoding by linear programming , 2005, IEEE Transactions on Information Theory.

[145]  Mark Jerrum,et al.  The Metropolis Algorithm for Graph Bisection , 1998, Discret. Appl. Math..