Random matrices in service of ML footprint: ternary random features with no performance loss

In this article, we investigate the spectral behavior of random features kernel matrices of the type K = Ew [ σ ( wTxi ) σ ( wTxj )]n i,j=1 , with nonlinear function σ(·), data x1, . . . ,xn ∈ R, and random projection vector w ∈ R having i.i.d. entries. In a high-dimensional setting where the number of data n and their dimension p are both large and comparable, we show, under a Gaussian mixture model for the data, that the eigenspectrum of K is independent of the distribution of the i.i.d. (zero-mean and unit-variance) entries of w, and only depends on σ(·) via its (generalized) Gaussian moments Ez∼N (0,1)[σ′(z)] and Ez∼N (0,1)[σ′′(z)]. As a result, for any kernel matrix K of the form above, we propose a novel random features technique, called Ternary Random Feature (TRF), that (i) asymptotically yields the same limiting kernel as the original K in a spectral sense and (ii) can be computed and stored much more efficiently, by wisely tuning (in a data-dependent manner) the function σ and the random vector w, both taking values in {−1, 0, 1}. The computation of the proposed random features requires no multiplication, and a factor of b times less bits for storage compared to classical random features such as random Fourier features, with b the number of bits to store full precision values. Besides, it appears in our experiments on real data that the substantial gains in computation and storage are accompanied with somewhat improved performances compared to state-of-the-art random features compression/quantization methods.

[1]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[2]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[3]  Vladimir Vovk,et al.  Kernel Ridge Regression , 2013, Empirical Inference.

[4]  Jeffrey Pennington,et al.  Nonlinear random matrix theory for deep learning , 2019, NIPS.

[5]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[6]  Andrew M. Saxe,et al.  High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.

[7]  Romain Couillet,et al.  Concentration of Measure and Large Random Matrices with an application to Sample Covariance Matrices , 2018, 1805.08295.

[8]  E. Stein,et al.  Functional Analysis: Introduction to Further Topics in Analysis , 2011 .

[9]  Romain Couillet,et al.  A Kernel Random Matrix-Based Approach for Sparse PCA , 2019, ICLR.

[10]  Yueming Lyu,et al.  Spherical Structured Feature Maps for Kernel Approximation , 2017, ICML.

[11]  Bernhard Schölkopf,et al.  Kernel Principal Component Analysis , 1997, ICANN.

[12]  Yuanzhi Li,et al.  Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[13]  Zhenyu Liao,et al.  A Random Matrix Approach to Neural Networks , 2017, ArXiv.

[14]  Trevor Campbell,et al.  Data-dependent compression of random features for large-scale kernel approximation , 2019, AISTATS.

[15]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[16]  Nicu Sebe,et al.  Binary Neural Networks: A Survey , 2020, Pattern Recognit..

[17]  Radford M. Neal Priors for Infinite Networks , 1996 .

[18]  Romain Couillet,et al.  Random Matrix Theory Proves that Deep Learning Representations of GAN-data Behave as Gaussian Mixtures , 2020, ICML.

[19]  L. Pastur On Random Matrices Arising in Deep Neural Networks. Gaussian Case , 2020, 2001.06188.

[20]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[21]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.