On Universal Features for High-Dimensional Learning and Inference

We consider the problem of identifying universal low-dimensional features from high-dimensional data for inference tasks in settings involving learning. For such problems, we introduce natural notions of universality and we show a local equivalence among them. Our analysis is naturally expressed via information geometry, and represents a conceptually and computationally useful analysis. The development reveals the complementary roles of the singular value decomposition, Hirschfeld-Gebelein-Renyi maximal correlation, the canonical correlation and principle component analyses of Hotelling and Pearson, Tishby's information bottleneck, Wyner's common information, Ky Fan $k$-norms, and Brieman and Friedman's alternating conditional expectations algorithm. We further illustrate how this framework facilitates understanding and optimizing aspects of learning systems, including multinomial logistic (softmax) regression and the associated neural network architecture, matrix factorization methods for collaborative filtering and other applications, rank-constrained multivariate linear regression, and forms of semi-supervised learning.

[1]  M. A. Chmielewski,et al.  Elliptically Symmetric Distributions: A Review and Bibliography , 1981 .

[2]  Venkat Anantharam,et al.  On hypercontractivity and the mutual information between Boolean functions , 2013, 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[3]  Reza Modarres,et al.  Measures of Dependence , 2011, International Encyclopedia of Statistical Science.

[4]  Lizhong Zheng,et al.  Bounds between contraction coefficients , 2015, 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[5]  Venkat Anantharam,et al.  Non-interactive simulation of joint distributions: The Hirschfeld-Gebelein-Rényi maximal correlation and the hypercontractivity ribbon , 2012, 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[6]  H. Hirschfeld A Connection between Correlation and Contingency , 1935, Mathematical Proceedings of the Cambridge Philosophical Society.

[7]  H. Gebelein Das statistische Problem der Korrelation als Variations‐ und Eigenwertproblem und sein Zusammenhang mit der Ausgleichsrechnung , 1941 .

[8]  Naftali Tishby,et al.  Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.

[9]  A. Dawid Spherical Matrix Distributions and a Multivariate Model , 1977 .

[10]  Paul W. Cuff,et al.  Gaussian secure source coding and Wyner's Common Information , 2015, 2015 IEEE International Symposium on Information Theory (ISIT).

[11]  Shao-Lun Huang,et al.  An efficient algorithm for information decomposition and extraction , 2015, 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[12]  Omer Levy,et al.  Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.

[13]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[14]  D. Sorensen Numerical methods for large eigenvalue problems , 2002, Acta Numerica.

[15]  K. Pearson Contributions to the Mathematical Theory of Evolution , 1894 .

[16]  G. Young Maximum likelihood estimation and factor analysis , 1941 .

[17]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[18]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[19]  R. Clarke,et al.  Theory and Applications of Correspondence Analysis , 1985 .

[20]  Martin J. Wainwright,et al.  Estimation of (near) low-rank matrices with noise and high-dimensional scaling , 2009, ICML.

[21]  Punyashloka Biswal,et al.  Hypercontractivity and its applications , 2011, ArXiv.

[22]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[23]  H. O. Lancaster The Structure of Bivariate Distributions , 1958 .

[24]  Andrea Montanari,et al.  Matrix completion from a few entries , 2009, 2009 IEEE International Symposium on Information Theory.

[25]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[26]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[27]  W. F. Kibble An extension of a theorem of Mehler's on Hermite polynomials , 1945, Mathematical Proceedings of the Cambridge Philosophical Society.

[28]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[29]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[30]  Sergio Verdú,et al.  Approximation theory of output statistics , 1993, IEEE Trans. Inf. Theory.

[31]  Alison L Gibbs,et al.  On Choosing and Bounding Probability Metrics , 2002, math/0209021.

[32]  C. Stein,et al.  Estimation with Quadratic Loss , 1992 .

[33]  Shotaro Akaho,et al.  A kernel method for canonical correlation analysis , 2006, ArXiv.

[34]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[35]  Olivier Ledoit,et al.  A well-conditioned estimator for large-dimensional covariance matrices , 2004 .

[36]  Lizhong Zheng,et al.  Euclidean Information Theory , 2008, 2008 IEEE International Zurich Seminar on Communications.

[37]  D. Brillinger Time series - data analysis and theory , 1981, Classics in applied mathematics.

[38]  Arindam Banerjee,et al.  Probabilistic Semi-Supervised Clustering with Constraints , 2006, Semi-Supervised Learning.

[39]  Douglas B. Terry,et al.  Using collaborative filtering to weave an information tapestry , 1992, CACM.

[40]  Meir Feder,et al.  An Information-Theoretic Framework for Non-linear Canonical Correlation Analysis , 2018, ArXiv.

[41]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[42]  J. Friedman,et al.  Estimating Optimal Transformations for Multiple Regression and Correlation. , 1985 .

[43]  Kenneth Rose,et al.  An Information-theoretic Learning Algorithm for Neural Network Classification , 1995, NIPS.

[44]  P. Gács,et al.  Spreading of Sets in Product Spaces and Hypercontraction of the Markov Operator , 1976 .

[45]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[46]  T. W. Anderson An Introduction to Multivariate Statistical Analysis , 1959 .

[47]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[48]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[49]  David Slepian,et al.  On the Symmetrized Kronecker Power of a Matrix and Extensions of Mehler’s Formula for Hermite Polynomials , 1972 .

[50]  Wm. R. Wright General Intelligence, Objectively Determined and Measured. , 1905 .

[51]  Emmanuel J. Candès,et al.  A Probabilistic and RIPless Theory of Compressed Sensing , 2010, IEEE Transactions on Information Theory.

[52]  Xiaodong Li,et al.  Dense error correction for low-rank matrices via Principal Component Pursuit , 2010, 2010 IEEE International Symposium on Information Theory.

[53]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[54]  Aaron D. Wyner,et al.  The common information of two dependent random variables , 1975, IEEE Trans. Inf. Theory.

[55]  Saharon Rosset,et al.  Generalized Independent Component Analysis Over Finite Alphabets , 2016, IEEE Trans. Inf. Theory.

[56]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[57]  Imre Csiszár,et al.  Information Theory and Statistics: A Tutorial , 2004, Found. Trends Commun. Inf. Theory.

[58]  Ken R. Duffy,et al.  Principal Inertia Components and Applications , 2017, IEEE Transactions on Information Theory.

[59]  Emmanuel J. Candès,et al.  Matrix Completion With Noise , 2009, Proceedings of the IEEE.

[60]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[61]  F. G. Mehler Ueber die Entwicklung einer Function von beliebig vielen Variablen nach Laplaceschen Functionen höherer Ordnung. , 1866 .

[62]  Erkki Oja,et al.  A class of neural networks for independent component analysis , 1997, IEEE Trans. Neural Networks.

[63]  Gal Chechik,et al.  Information Bottleneck for Gaussian Variables , 2003, J. Mach. Learn. Res..

[64]  Kilian Q. Weinberger,et al.  Spectral Methods for Dimensionality Reduction , 2006, Semi-Supervised Learning.

[65]  Lizhong Zheng,et al.  Polynomial Singular Value Decompositions of a Family of Source-Channel Models , 2017, IEEE Transactions on Information Theory.

[66]  Shao-Lun Huang,et al.  An information-theoretic approach to universal feature selection in high-dimensional inference , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[67]  Huda Khayrallah,et al.  Deep Generalized Canonical Correlation Analysis , 2017, RepL4NLP@ACL.

[68]  Stefano Soatto,et al.  Emergence of invariance and disentangling in deep representations , 2017 .

[69]  Nathan Srebro,et al.  Learning with matrix factorizations , 2004 .

[70]  E. Oja Simplified neuron model as a principal component analyzer , 1982, Journal of mathematical biology.

[71]  B. L. Roux,et al.  Geometric Data Analysis: From Correspondence Analysis to Structured Data Analysis , 2004 .

[72]  M. Kramer Nonlinear principal component analysis using autoassociative neural networks , 1991 .

[73]  Jim Kay,et al.  Canonical Correlation Analysis Using a Neural Network , 1992 .

[74]  Alexander Basilevsky,et al.  Statistical Factor Analysis and Related Methods , 1994 .

[75]  Venkat Anantharam,et al.  On Maximal Correlation, Hypercontractivity, and the Data Processing Inequality studied by Erkip and Cover , 2013, ArXiv.

[76]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[77]  Sophie Ahrens,et al.  Recommender Systems , 2012 .

[78]  N. L. Johnson,et al.  Linear Statistical Inference and Its Applications , 1966 .

[79]  W. Rudin Principles of mathematical analysis , 1964 .

[80]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[81]  I. Csiszár A class of measures of informativity of observation channels , 1972 .

[82]  G. Stewart Perturbation theory for the singular value decomposition , 1990 .

[83]  A. Lewis The Convex Analysis of Unitarily Invariant Matrix Functions , 1995 .

[84]  Duolao Wang,et al.  Estimating Optimal Transformations for Multiple Regression Using the ACE Algorithm , 2004, Journal of Data Science.

[85]  R. W. Wedderburn,et al.  Generalized Linear Models Specified in Terms of Constraints , 1974 .

[86]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2008, Found. Comput. Math..

[87]  Pedro J. Zufiria,et al.  Generalized neural networks for spectral analysis: dynamics and Liapunov functions , 2004, Neural Networks.

[88]  Shao-Lun Huang,et al.  Gaussian Universal Features, Canonical Correlations, and Common Information , 2018, 2018 IEEE Information Theory Workshop (ITW).

[89]  Lizhong Zheng,et al.  A Coordinate System for Gaussian Networks , 2010, IEEE Transactions on Information Theory.

[90]  Yihong Wu,et al.  Strong data-processing inequalities for channels and Bayesian networks , 2015, 1508.06025.

[91]  C. Anderson‐Cook,et al.  An Introduction to Multivariate Statistical Analysis (3rd ed.) (Book) , 2004 .

[92]  D. Cox The Regression Analysis of Binary Sequences , 2017 .

[93]  James Bennett,et al.  The Netflix Prize , 2007 .

[94]  Lizhong Zheng,et al.  Polynomial spectral decomposition of conditional expectation operators , 2016, 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[95]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[96]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[97]  Daniel M. Roy,et al.  Neural Network Matrix Factorization , 2015, ArXiv.

[98]  Charles R. Johnson,et al.  Matrix Analysis, 2nd Ed , 2012 .

[99]  M. Veloso,et al.  Latent Variable Models , 2019, Statistical and Econometric Methods for Transportation Data Analysis.

[100]  Jon Atli Benediktsson,et al.  Linear Versus Nonlinear PCA for the Classification of Hyperspectral Data Based on the Extended Morphological Profiles , 2012, IEEE Geoscience and Remote Sensing Letters.

[101]  Lizhong Zheng,et al.  Probabilistic Clustering using Maximal Matrix Norm Couplings , 2018, 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[102]  M. Haber Maximum likelihood methods for linear and log-linear models in categorical data , 1985 .

[103]  Lizhong Zheng,et al.  Linear Bounds between Contraction Coefficients for $f$-Divergences , 2015, 1510.01844.

[104]  Karen Livescu,et al.  Nonparametric Canonical Correlation Analysis , 2015, ICML.

[105]  Erkki Oja,et al.  The nonlinear PCA learning rule in independent component analysis , 1997, Neurocomputing.

[106]  Shao-Lun Huang,et al.  Linear information coupling problems , 2012, 2012 IEEE International Symposium on Information Theory Proceedings.

[107]  Paul Resnick,et al.  Recommender systems , 1997, CACM.

[108]  Emmanuel J. Candès,et al.  The Power of Convex Relaxation: Near-Optimal Matrix Completion , 2009, IEEE Transactions on Information Theory.

[109]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[110]  G. W. STEWARTt ON THE EARLY HISTORY OF THE SINGULAR VALUE DECOMPOSITION * , 2022 .

[111]  A. Izenman Reduced-rank regression for the multivariate linear model , 1975 .

[112]  Xiangxiang Xu,et al.  On The Sample Complexity of HGR Maximal Correlation Functions , 2019, 2019 IEEE Information Theory Workshop (ITW).

[113]  M. Manser,et al.  Chi-Squared Distribution , 2010 .

[114]  H. Witsenhausen ON SEQUENCES OF PAIRS OF DEPENDENT RANDOM VARIABLES , 1975 .

[115]  Venkat Anantharam,et al.  On hypercontractivity and a data processing inequality , 2014, 2014 IEEE International Symposium on Information Theory.

[116]  C. Eckart,et al.  The approximation of one matrix by another of lower rank , 1936 .

[117]  A. Tsybakov,et al.  Estimation of high-dimensional low-rank matrices , 2009, 0912.5338.

[118]  Thomas A. Courtade,et al.  Which Boolean functions are most informative? , 2013, 2013 IEEE International Symposium on Information Theory.

[119]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[120]  Naftali Tishby,et al.  Data Clustering by Markovian Relaxation and the Information Bottleneck Method , 2000, NIPS.

[121]  J. Leeuw,et al.  The Gifi system of descriptive multivariate analysis , 1998 .

[122]  H. J. Scudder,et al.  Probability of error of some adaptive pattern-recognition machines , 1965, IEEE Trans. Inf. Theory.

[123]  Pablo A. Parrilo,et al.  Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization , 2007, SIAM Rev..

[124]  Sajid Javed,et al.  Robust Subspace Learning: Robust PCA, Robust Subspace Tracking, and Robust Subspace Recovery , 2017, IEEE Signal Processing Magazine.

[125]  Erkki Oja,et al.  Principal components, minor components, and linear neural networks , 1992, Neural Networks.

[126]  E. Schmidt Zur Theorie der linearen und nichtlinearen Integralgleichungen , 1907 .

[127]  Robert J. Plemmons,et al.  Nonnegative Matrices in the Mathematical Sciences , 1979, Classics in Applied Mathematics.

[128]  Meir Feder,et al.  Binary independent component analysis: Theory, bounds and algorithms , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[129]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[130]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[131]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[132]  Jean Ponce,et al.  Convex Sparse Matrix Factorizations , 2008, ArXiv.

[133]  A. J. Bell,et al.  A Unifying Information-Theoretic Framework for Independent Component Analysis , 2000 .

[134]  Xiaojin Zhu,et al.  Semi-Supervised Learning , 2010, Encyclopedia of Machine Learning.