Algorithms for subset selection in linear regression

We study the problem of selecting a subset of k random variables to observe that will yield the best linear prediction of another variable of interest, given the pairwise correlations between the observation variables and the predictor variable. Under approximation preserving reductions, this problem is equivalent to the "sparse approximation" problem of approximating signals concisely. The subset selection problem is NP-hard in general; in this paper, we propose and analyze exact and approximation algorithms for several special cases of practical interest. Specifically, we give an FPTAS when the covariance matrix has constant bandwidth, and exact algorithms when the associated covariance graph, consisting of edges for pairs of variables with non-zero correlation, forms a tree or has a large (known) independent set. Furthermore, we give an exact algorithm when the variables can be embedded into a line such that the covariance decreases exponentially in the distance, and a constant-factor approximation when the variables have no "conditional suppressor variables". Much of our reasoning is based on perturbation results for the R2 multiple correlation measure, which is frequently used as a natural measure for "goodness-of-fit statistics". It lies at the core of our FPTAS, and also allows us to extend our exact algorithms to approximation algorithms when the matrix "nearly" falls into one of the above classes. We also use our perturbation analysis to prove approximation guarantees for the widely used "Forward Regression" heuristic under the assumption that the observation variables are nearly independent.

[1]  Curt Schurgers,et al.  Leveraging Redundancy in Sampling-Interpolation Applications for Sensor Networks , 2007, DCOSS.

[2]  G. Stewart,et al.  Matrix Perturbation Theory , 1990 .

[3]  Andreas Krause,et al.  Near-optimal sensor placements in Gaussian processes , 2005, ICML.

[4]  Uriel G. Rothblum,et al.  A Polynomial Time Algorithm for Shaped Partition Problems , 1999, SIAM J. Optim..

[5]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[6]  S. Mallat,et al.  Adaptive greedy approximations , 1997 .

[7]  Jon Lee,et al.  Maximum-entropy remote sampling , 2001, Discret. Appl. Math..

[8]  Leonard J. Schulman,et al.  The Vector Partition Problem for Convex Objective Functions , 2001, Math. Oper. Res..

[9]  D. A. Kenny,et al.  Statistics for the social and behavioral sciences , 1987 .

[10]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[11]  V. Temlyakov Greedy Algorithms andM-Term Approximation with Regard to Redundant Dictionaries , 1999 .

[12]  Balas K. Natarajan,et al.  Sparse Approximate Solutions to Linear Systems , 1995, SIAM J. Comput..

[13]  S. Muthukrishnan,et al.  Improved sparse approximation over quasiincoherent dictionaries , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[14]  Joel A. Tropp,et al.  Just relax: convex programming methods for identifying sparse signals in noise , 2006, IEEE Transactions on Information Theory.

[15]  Maurice Queyranne,et al.  An Exact Algorithm for Maximum Entropy Sampling , 1995, Oper. Res..

[16]  Alan J. Miller Subset Selection in Regression , 1992 .

[17]  David A. Walker Suppressor Variable(s) Importance within a Regression Model: An Example of Salary Compression from Career Services , 2003 .

[18]  James B. Saxe,et al.  Dynamic-Programming Algorithms for Recognizing Small-Bandwidth Graphs in Polynomial Time , 1980, SIAM J. Algebraic Discret. Methods.

[19]  Michael Elad,et al.  Stable recovery of sparse overcomplete representations in the presence of noise , 2006, IEEE Transactions on Information Theory.

[20]  M. Hashem Pesaran,et al.  A Generalised R2 Criterion for Regression Models Estimated by the Instrumental Variable Method , 1994 .

[21]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[22]  Wei Hong,et al.  Model-Driven Data Acquisition in Sensor Networks , 2004, VLDB.

[23]  Vladimir N. Temlyakov,et al.  Nonlinear Methods of Approximation , 2003, Found. Comput. Math..

[24]  S. Muthukrishnan,et al.  Approximation of functions over redundant dictionaries using coherence , 2003, SODA '03.

[25]  Joel A. Tropp,et al.  Greed is good: algorithmic results for sparse approximation , 2004, IEEE Transactions on Information Theory.

[26]  E. Candès,et al.  Stable signal recovery from incomplete and inaccurate measurements , 2005, math/0503066.

[27]  V. Flack,et al.  Frequency of Selecting Noise Variables in Subset Regression Analysis: A Simulation Study , 1987 .

[28]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[29]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[30]  G. Nemhauser,et al.  Exceptional Paper—Location of Bank Accounts to Optimize Float: An Analytic Study of Exact and Approximate Algorithms , 1977 .

[31]  Wayne F. Velicer,et al.  Suppressor Variables and the Semipartial Correlation Coefficient , 1978 .

[32]  Yoram Bresler,et al.  On the Optimality of the Backward Greedy Algorithm for the Subset Selection Problem , 2000, SIAM J. Matrix Anal. Appl..

[33]  Richard A. Johnson,et al.  Applied Multivariate Statistical Analysis , 1983 .

[34]  G. Diekhoff,et al.  Basic statistics for the social and behavioral sciences , 1996 .

[35]  Jacob Cohen,et al.  Applied multiple regression/correlation analysis for the behavioral sciences , 1979 .

[36]  D. Donoho For most large underdetermined systems of equations, the minimal 𝓁1‐norm near‐solution approximates the sparsest near‐solution , 2006 .

[37]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[38]  M. Hashem Pesaran,et al.  ESTIMATED BY THE INSTRUMENTAL VARIABLES METHOD , 1994 .

[39]  Joel A. Tropp,et al.  Topics in sparse approximation , 2004 .

[40]  William G. Cochran,et al.  Some Effects of Errors of Measurement on Multiple Correlation , 1970 .