Semi-Supervised Linear Regression

We study a regression problem where for some part of the data we observe both the label variable ($Y$) and the predictors (${\bf X}$), while for other part of the data only the predictors are given. Such a problem arises, for example, when observations of the label variable are costly and may require a skilled human agent. If the conditional expectation $E[Y | {\bf X}]$ is exactly linear in ${\bf X}$ then typically the additional observations of the ${\bf X}$'s do not contain useful information, but otherwise the unlabeled data can be informative. In this case, our aim is at constructing the best linear predictor. We suggest improved alternative estimates to the naive standard procedures that depend only on the labeled data. Our estimation method can be easily implemented and has simply described asymptotic properties. The new estimates asymptotically dominate the usual standard procedures under certain non-linearity condition of $E[Y | {\bf X}]$; otherwise, they are asymptotically equivalent. The performance of the new estimator for small sample size is investigated in an extensive simulation study. A real data example of inferring homeless population is used to illustrate the new methodology.

[1]  R. Berk,et al.  Small Area Estimation of the Homeless in Los Angeles: An Application of Cost-Sensitive stochastic Gradient Boosting , 2010, 1011.2890.

[2]  Tong Zhang,et al.  Graph-Based Semi-Supervised Learning and Spectral Kernel Design , 2008, IEEE Transactions on Information Theory.

[3]  Wei Pan,et al.  On Efficient Large Margin Semisupervised Learning: Method and Theory , 2009, J. Mach. Learn. Res..

[4]  Andreas Buja,et al.  Models as Approximations: How Random Predictors and Model Violations Invalidate Classical Inference in Regression , 2014 .

[5]  H. White Using Least Squares to Approximate Unknown Regression Functions , 1980 .

[6]  Mike West,et al.  The Use of Unlabeled Data in Predictive Modeling , 2007, 0710.4618.

[7]  Robert D. Nowak,et al.  Multi-Manifold Semi-Supervised Learning , 2009, AISTATS.

[8]  L. Brown An Ancillarity Paradox Which Appears in Multiple Linear Regression , 1990 .

[9]  Roderick J. A. Little Regression with Missing X's: A Review , 1992 .

[10]  T. N. Sriram Asymptotics in Statistics–Some Basic Concepts , 2002 .

[11]  Xiaojin Zhu,et al.  Semi-Supervised Learning Literature Survey , 2005 .

[12]  Tianxi Cai,et al.  Efficient and adaptive linear regression in semi-supervised settings , 2017, The Annals of Statistics.

[13]  Sara van de Geer,et al.  High-dimensional inference in misspecified linear models , 2015, 1503.06426.

[14]  Grace L. Yang,et al.  Asymptotics In Statistics , 1990 .

[15]  A. Buja,et al.  Models as Approximations, Part I: A Conspiracy of Nonlinearity and Random Regressors in Linear Regression , 2014, 1404.1578.

[16]  Hadi Fanaee-T,et al.  Event labeling combining ensemble detectors and background knowledge , 2014, Progress in Artificial Intelligence.

[17]  Günther Palm,et al.  Semi-supervised Learning for Regression with Co-training by Committee , 2009, ICANN.

[18]  Kenneth Joseph Ryan,et al.  On semi-supervised linear regression in covariate shift problems , 2015, J. Mach. Learn. Res..

[19]  D.C. St. Clair,et al.  SeMi-supervised adaptive resonance theory (SMART2) , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[20]  Zhi-Hua Zhou,et al.  Semi-Supervised Regression with Co-Training , 2005, IJCAI.

[21]  M. Kendall Theoretical Statistics , 1956, Nature.

[22]  Larry A. Wasserman,et al.  Statistical Analysis of Semi-Supervised Regression , 2007, NIPS.