Truncated Linear Regression in High Dimensions

As in standard linear regression, in truncated linear regression, we are given access to observations $(A_i, y_i)_i$ whose dependent variable equals $y_i= A_i^{\rm T} \cdot x^* + \eta_i$, where $x^*$ is some fixed unknown vector of interest and $\eta_i$ is independent noise; except we are only given an observation if its dependent variable $y_i$ lies in some "truncation set" $S \subset \mathbb{R}$. The goal is to recover $x^*$ under some favorable conditions on the $A_i$'s and the noise distribution. We prove that there exists a computationally and statistically efficient method for recovering $k$-sparse $n$-dimensional vectors $x^*$ from $m$ truncated samples, which attains an optimal $\ell_2$ reconstruction error of $O(\sqrt{(k \log n)/m})$. As a corollary, our guarantees imply a computationally efficient and information-theoretically optimal algorithm for compressed sensing with truncation, which may arise from measurement saturation effects. Our result follows from a statistical and computational analysis of the Stochastic Gradient Descent (SGD) algorithm for solving a natural adaptation of the LASSO optimization problem that accommodates truncation. This generalizes the works of both: (1) [Daskalakis et al. 2018], where no regularization is needed due to the low-dimensionality of the data, and (2) [Wainright 2009], where the objective function is simple due to the absence of truncation. In order to deal with both truncation and high-dimensionality at the same time, we develop new techniques that not only generalize the existing ones but we believe are of independent interest.

[1]  Richard G. Baraniuk,et al.  A simple proof that random matrices are democratic , 2009, ArXiv.

[2]  Helmut Schneider Truncated and censored samples from normal populations , 1986 .

[3]  Richard Breen,et al.  Regression Models: Censored, Sample Selected, or Truncated Data , 1996 .

[4]  Jerry A. Hausman,et al.  Social Experimentation, Truncated Distributions, and Efficient Estimation , 1977 .

[5]  Econo Metrica REGRESSION ANALYSIS WHEN THE DEPENDENT VARIABLE IS TRUNCATED NORMAL , 2016 .

[6]  P. Schmidt,et al.  Limited-Dependent and Qualitative Variables in Econometrics. , 1984 .

[7]  M. Rudelson,et al.  Non-asymptotic theory of random matrices: extreme singular values , 2010, 1003.2990.

[8]  Richard G. Baraniuk,et al.  Democracy in Action: Quantization, Saturation, and Compressive Sensing , 2011 .

[9]  C. B. Morgan Truncated and Censored Samples, Theory and Applications , 1993 .

[10]  Sylvain Chevillard,et al.  The functions erf and erfc computed with arbitrary precision and explicit error bounds , 2009, Inf. Comput..

[11]  Christos Tzamos,et al.  Computationally and Statistically Efficient Truncated Regression , 2020, COLT.

[12]  E. Candès,et al.  Stable signal recovery from incomplete and inaccurate measurements , 2005, math/0503066.

[13]  D. McFadden,et al.  The method of simulated scores for the estimation of LDV models , 1998 .

[14]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[15]  D. Donoho For most large underdetermined systems of linear equations the minimal 𝓁1‐norm solution is also the sparsest solution , 2006 .

[16]  J. Tobin Estimation of Relationships for Limited Dependent Variables , 1958 .

[17]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[18]  Michael Keane,et al.  Simulation estimation for panel data models with limited dependent variables , 1993 .

[19]  K. Pearson,et al.  ON THE GENERALISED PROBABLE ERROR IN MULTIPLE NORMAL CORRELATION , 1908 .

[20]  Narayanaswamy Balakrishnan,et al.  The Art of Progressive Censoring , 2014 .

[21]  Francis Galton,et al.  An examination into the registered speeds of American trotting horses, with remarks on their value as hereditary data , 1898, Proceedings of the Royal Society of London.