Semi-supervised inference: General theory and estimation of means

We propose a general semi-supervised inference framework focused on the estimation of the population mean. As usual in semi-supervised settings, there exists an unlabeled sample of covariate vectors and a labeled sample consisting of covariate vectors along with real-valued responses ("labels"). Otherwise, the formulation is "assumption-lean" in that no major conditions are imposed on the statistical or functional form of the data. We consider both the ideal semi-supervised setting where infinitely many unlabeled samples are available, as well as the ordinary semi-supervised setting in which only a finite number of unlabeled samples is available. Estimators are proposed along with corresponding confidence intervals for the population mean. Theoretical analysis on both the asymptotic distribution and $\ell_2$-risk for the proposed procedures are given. Surprisingly, the proposed estimators, based on a simple form of the least squares method, outperform the ordinary sample mean. The simple, transparent form of the estimator lends confidence to the perception that its asymptotic improvement over the ordinary sample mean also nearly holds even for moderate size samples. The method is further extended to a nonparametric setting, in which the oracle rate can be achieved asymptotically. The proposed estimators are further illustrated by simulation studies and a real data example involving estimation of the homeless population.

[1]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[2]  A. Winsor Sampling techniques. , 2000, Nursing times.

[3]  A. Buja,et al.  Models as Approximations, Part I: A Conspiracy of Nonlinearity and Random Regressors in Linear Regression , 2014, 1404.1578.

[4]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[5]  Larry A. Wasserman,et al.  Statistical Analysis of Semi-Supervised Regression , 2007, NIPS.

[6]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[7]  Lih-Yuan Deng,et al.  Estimation of Variance of the Regression Estimator , 1987 .

[8]  On efficient estimation of linear functionals of a bivariate distribution with known marginals , 2002 .

[9]  Xiaojin Zhu,et al.  Introduction to Semi-Supervised Learning , 2009, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[10]  Junhui Wang,et al.  Efficient large margin semisupervised learning , 2007, AISTATS.

[11]  Robert D. Tortora,et al.  Sampling: Design and Analysis , 2000 .

[12]  Yufeng Liu,et al.  Probability estimation for large-margin classifiers , 2008 .

[13]  A. Buja,et al.  Improved Precision in Estimating Average Treatment Effects , 2013, 1311.0291.

[14]  P. Yaskov Lower bounds on the smallest eigenvalue of a sample covariance matrix. , 2014, 1409.6188.

[15]  Bryan S. Graham,et al.  Efficiency Bounds for Missing Data Models with Semiparametric Restrictions , 2008 .

[16]  D. Rubin [On the Application of Probability Theory to Agricultural Experiments. Essay on Principles. Section 9.] Comment: Neyman (1923) and Causal Inference in Experiments and Observational Studies , 1990 .

[17]  Ronald L. Wasserstein,et al.  Monte Carlo: Concepts, Algorithms, and Applications , 1997 .

[18]  R. Vershynin How Close is the Sample Covariance Matrix to the Actual Covariance Matrix? , 2010, 1004.3484.

[19]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[20]  H. Teicher,et al.  Probability theory: Independence, interchangeability, martingales , 1978 .

[21]  Junhui Wang,et al.  Large Margin Semi-supervised Learning , 2007, J. Mach. Learn. Res..

[22]  Andreas Buja,et al.  Semi-Supervised Linear Regression , 2016, Journal of the American Statistical Association.

[23]  Paul Bratley,et al.  A guide to simulation , 1983 .

[24]  Tianxi Cai,et al.  Efficient and adaptive linear regression in semi-supervised settings , 2017, The Annals of Statistics.

[25]  Arun K. Kuchibhotla,et al.  Models as Approximations --- Part II: A General Theory of Model-Robust Regression , 2016, 1612.03257.

[26]  Xiaojin Zhu,et al.  Semi-Supervised Learning Literature Survey , 2005 .

[27]  Tong Zhang,et al.  Two-view feature generation model for semi-supervised learning , 2007, ICML '07.

[28]  A. Buja,et al.  Models as Approximations — A Conspiracy of Random Predictors and Model Violations Against Classical Inference in Regression , 2014 .

[29]  R. Berk,et al.  Small Area Estimation of the Homeless in Los Angeles: An Application of Cost-Sensitive stochastic Gradient Boosting , 2010, 1011.2890.

[30]  Tong Zhang,et al.  Graph-Based Semi-Supervised Learning and Spectral Kernel Design , 2008, IEEE Transactions on Information Theory.

[31]  Wei Pan,et al.  On Efficient Large Margin Semisupervised Learning: Method and Theory , 2009, J. Mach. Learn. Res..

[32]  I. Ibragimov,et al.  On asymptotic efficiency in the presence of an infinitedimensional nuisance parameter , 1983 .

[33]  Bodhisattva Sen,et al.  Semiparametric Statistics , 2018 .

[34]  P. Bickel,et al.  Efficient estimation of linear functionals of a probability measure P with known marginal distributions , 1991 .

[35]  T. Speed,et al.  On the Application of Probability Theory to Agricultural Experiments. Essay on Principles. Section 9 , 1990 .

[36]  A. Owen,et al.  Control variates for quasi-Monte Carlo , 2005 .

[37]  P. Rossi Strategies for homeless research in the 1990s , 1991 .

[38]  P. Bickel Efficient and Adaptive Estimation for Semiparametric Models , 1993 .