Scalable Interpretable Multi-Response Regression via SEED

Sparse reduced-rank regression is an important tool to uncover meaningful dependence structure between large numbers of predictors and responses in many big data applications such as genome-wide association studies and social media analysis. Despite the recent theoretical and algorithmic advances, scalable estimation of sparse reduced-rank regression remains largely unexplored. In this paper, we suggest a scalable procedure called sequential estimation with eigen-decomposition (SEED) which needs only a single top-$r$ singular value decomposition to find the optimal low-rank and sparse matrix by solving a sparse generalized eigenvalue problem. Our suggested method is not only scalable but also performs simultaneous dimensionality reduction and variable selection. Under some mild regularity conditions, we show that SEED enjoys nice sampling properties including consistency in estimation, rank selection, prediction, and model selection. Numerical studies on synthetic and real data sets show that SEED outperforms the state-of-the-art approaches for large-scale matrix estimation problem.

[1]  Xiao-Tong Yuan,et al.  Truncated power method for sparse eigenvalue problems , 2011, J. Mach. Learn. Res..

[2]  秀俊 松井,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2014 .

[3]  P. Massart,et al.  Adaptive estimation of a quadratic functional by model selection , 2000 .

[4]  A. Willsky,et al.  Latent variable graphical model selection via convex optimization , 2010 .

[5]  Christos Faloutsos,et al.  Kronecker Graphs: An Approach to Modeling Networks , 2008, J. Mach. Learn. Res..

[6]  Le Song,et al.  Learning Social Infectivity in Sparse Low-rank Networks Using Multi-dimensional Hawkes Processes , 2013, AISTATS.

[7]  M. Yuan,et al.  Dimension reduction and coefficient estimation in multivariate linear regression , 2007 .

[8]  Emmanuel J. Candès,et al.  Decoding by linear programming , 2005, IEEE Transactions on Information Theory.

[9]  Jure Leskovec,et al.  Inferring networks of diffusion and influence , 2010, KDD.

[10]  Dipak K Dey,et al.  Sequential Co-Sparse Factor Regression , 2017, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[11]  Michael J. Black,et al.  A Framework for Robust Subspace Learning , 2003, International Journal of Computer Vision.

[12]  G. Reinsel,et al.  Multivariate Reduced-Rank Regression: Theory and Applications , 1998 .

[13]  Martin J. Wainwright,et al.  Minimax Rates of Estimation for High-Dimensional Linear Regression Over $\ell_q$ -Balls , 2009, IEEE Transactions on Information Theory.

[14]  A. Izenman Reduced-rank regression for the multivariate linear model , 1975 .

[15]  Juha Karhunen,et al.  Principal component neural networks — Theory and applications , 1998, Pattern Analysis and Applications.

[16]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[17]  Yan Liu,et al.  An Examination of Practical Granger Causality Inference , 2013, SDM.

[18]  M. Wegkamp,et al.  Joint variable and rank selection for parsimonious estimation of high-dimensional matrices , 2011, 1110.3556.

[19]  C. Stein,et al.  Estimation with Quadratic Loss , 1992 .

[20]  Shuicheng Yan,et al.  Robust Subspace Segmentation with Block-Diagonal Prior , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Zongming Ma Sparse Principal Component Analysis and Iterative Thresholding , 2011, 1112.2432.

[22]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[23]  Christos Faloutsos,et al.  Spectral Analysis for Billion-Scale Graphs: Discoveries and Implementation , 2011, PAKDD.

[24]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[25]  Kung-Sik Chan,et al.  Reduced rank stochastic regression with a sparse singular value decomposition , 2012 .

[26]  Uri T Eden,et al.  A point process framework for relating neural spiking activity to spiking history, neural ensemble, and extrinsic covariate effects. , 2005, Journal of neurophysiology.

[27]  Jinchi Lv,et al.  High dimensional thresholded regression and shrinkage effect , 2014, 1605.03306.

[28]  John Langford,et al.  Scaling up machine learning: parallel and distributed approaches , 2011, KDD '11 Tutorials.

[29]  Krishna P. Gummadi,et al.  Measuring User Influence in Twitter: The Million Follower Fallacy , 2010, ICWSM.

[30]  I. Johnstone,et al.  On Consistency and Sparsity for Principal Components Analysis in High Dimensions , 2009, Journal of the American Statistical Association.

[31]  I. Johnstone On the distribution of the largest eigenvalue in principal components analysis , 2001 .

[32]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[33]  Francis R. Bach,et al.  Consistency of trace norm minimization , 2007, J. Mach. Learn. Res..

[34]  Naoki Abe,et al.  Group Orthogonal Matching Pursuit for Logistic Regression , 2011, AISTATS.

[35]  Martin J. Wainwright,et al.  Estimation of (near) low-rank matrices with noise and high-dimensional scaling , 2009, ICML.

[36]  Leysia Palen,et al.  (How) will the revolution be retweeted?: information diffusion and the 2011 Egyptian uprising , 2012, CSCW.

[37]  Kung-Sik Chan,et al.  A note on rank reduction in sparse multivariate regression , 2016, Journal of statistical theory and practice.

[38]  Huan Xu,et al.  Provable Subspace Clustering: When LRR Meets SSC , 2013, IEEE Transactions on Information Theory.

[39]  Jure Leskovec,et al.  Information diffusion and external influence in networks , 2012, KDD.

[40]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[41]  Tong Zhang,et al.  Adaptive Forward-Backward Greedy Algorithm for Learning Sparse Representations , 2011, IEEE Transactions on Information Theory.

[42]  Stéphane Gaïffas,et al.  Link prediction in graphs with autoregressive features , 2012, J. Mach. Learn. Res..

[43]  Yingying Fan,et al.  Tuning parameter selection in high dimensional penalized likelihood , 2013, 1605.03321.

[44]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[45]  Dan Shen,et al.  A General Framework for Consistency of Principal Component Analysis , 2012, J. Mach. Learn. Res..

[46]  Vincent Q. Vu,et al.  Sparsistency and agnostic inference in sparse PCA , 2014, 1401.6978.

[47]  Indrajit Bhattacharya,et al.  A bayesian framework for estimating properties of network diffusions , 2014, KDD.

[48]  K. Pauwels,et al.  Effects of Word-of-Mouth versus Traditional Marketing: Findings from an Internet Social Networking Site , 2009 .

[49]  T. Cai,et al.  Sparse PCA: Optimal rates and adaptive estimation , 2012, 1211.1309.

[50]  Martin J. Wainwright,et al.  Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions , 2011, ICML.

[51]  Kung-Sik Chan,et al.  Reduced rank regression via adaptive nuclear norm penalization. , 2012, Biometrika.

[52]  Emmanuel J. Candès,et al.  Tight Oracle Inequalities for Low-Rank Matrix Recovery From a Minimal Number of Noisy Random Measurements , 2011, IEEE Transactions on Information Theory.

[53]  M. Wegkamp,et al.  Optimal selection of reduced rank estimators of high-dimensional matrices , 2010, 1004.2995.

[54]  Emmanuel J. Candès,et al.  A Singular Value Thresholding Algorithm for Matrix Completion , 2008, SIAM J. Optim..

[55]  Yingying Fan,et al.  Asymptotic Equivalence of Regularization Methods in Thresholded Parameter Space , 2013, 1605.03310.