Input-dependent estimation of generalization error under covariate shift

A common assumption in supervised learning is that the training and test input points follow the same probability distribution. However, this assumption is not fulfilled, e.g., in interpolation, extrapolation, active learning, or classification with imbalanced data. The violation of this assumption—known as the covariate shift—causes a heavy bias in standard generalization error estimation schemes such as cross-validation or Akaike's information criterion, and thus they result in poor model selection. In this paper, we propose an alternative estimator of the generalization error for the squared loss function when training and test distributions are different. The proposed generalization error estimator is shown to be exactly unbiased for finite samples if the learning target function is realizable and asymptotically unbiased in general. We also show that, in addition to the unbiasedness, the proposed generalization error estimator can accurately estimate the difference of the generalization error among different models, which is a desirable property in model selection. Numerical studies show that the proposed method compares favorably with existing model selection methods in regression for extrapolation and in classification with imbalanced data.

[1]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[2]  Calyampudi R. Rao,et al.  Linear statistical inference and its applications , 1965 .

[3]  W. J. Studden,et al.  Theory Of Optimal Experiments , 1972 .

[4]  H. Akaike A new look at the statistical model identification , 1974 .

[5]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[6]  Ramon E. Henkel Tests of Significance , 1976 .

[7]  J. Heckman Sample selection bias as a specification error , 1979 .

[8]  Peter J. Huber,et al.  Robust Statistics , 2005, Wiley Series in Probability and Statistics.

[9]  Ker-Chau Li,et al.  Asymptotic optimality of CL and generalized cross-validation in ridge regression with application to spline smoothing , 1986 .

[10]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[11]  齋藤 三郎,et al.  Theory of reproducing kernels and its applications , 1988 .

[12]  H. Linhart A test whether two AIC's differ significantly , 1988 .

[13]  Saburou Saitoh,et al.  Theory of Reproducing Kernels and Its Applications , 1988 .

[14]  David J. C. MacKay,et al.  Information-Based Objective Functions for Active Data Selection , 1992, Neural Computation.

[15]  F. Pukelsheim Optimal Design of Experiments , 1993 .

[16]  J. Friedman,et al.  A Statistical View of Some Chemometrics Regression Tools , 1993 .

[17]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[18]  Shun-ichi Amari,et al.  Network information criterion-determining the number of hidden units for an artificial neural network model , 1994, IEEE Trans. Neural Networks.

[19]  Peter M. Williams,et al.  Bayesian Regularization and Pruning Using a Laplace Prior , 1995, Neural Computation.

[20]  G. Kitagawa,et al.  Generalised information criteria in model selection , 1996 .

[21]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[22]  Ah Chung Tsoi,et al.  Neural Network Classification and Prior Class Probabilities , 1996, Neural Networks: Tricks of the Trade.

[23]  G. Kitagawa,et al.  Bootstrapping Log Likelihood and EIC, an Extension of AIC , 1997 .

[24]  Hidetoshi Shimodaira Assessing the Error Probability of the Model Selection Test , 1997 .

[25]  齋藤 三郎 Integral transforms, reproducing kernels and their applications , 1997 .

[26]  Zhengyou Zhang,et al.  Parameter estimation techniques: a tutorial with application to conic fitting , 1997, Image Vis. Comput..

[27]  S. Saitoh Integral Transforms, Reproducing Kernels and Their Applications , 1997 .

[28]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[29]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[30]  Hidetoshi Shimodaira An Application of Multiple Comparison Techniques to Model Selection , 1998 .

[31]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[32]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[33]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[34]  H. Shimodaira,et al.  Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[35]  Kenji Fukumizu,et al.  Statistical active learning in multilayer perceptrons , 2000, IEEE Trans. Neural Networks Learn. Syst..

[36]  Tomaso A. Poggio,et al.  Regularization Networks and Support Vector Machines , 2000, Adv. Comput. Math..

[37]  Masashi Sugiyama,et al.  Incremental Active Learning for Optimal Generalization , 2000, Neural Computation.

[38]  Masashi Sugiyama,et al.  Subspace Information Criterion for Model Selection , 2001, Neural Computation.

[39]  H. Ogawa,et al.  Active Learning for Optimal Generalization in Trigonometric Polynomial Models , 2001 .

[40]  Masashi Sugiyama,et al.  Optimal design of regularization term and regularization parameter by subspace information criterion , 2002, Neural Networks.

[41]  Klaus-Robert Müller,et al.  The Subspace Information Criterion for Infinite Dimensional Hypothesis Spaces , 2003, J. Mach. Learn. Res..

[42]  Vladimir Spokoiny Variance Estimation for High-Dimensional Regression Models , 2002 .

[43]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machines , 2002 .

[44]  Mk Warmuth,et al.  Active Learning with SVMs in the Drug Discovery Process , 2003 .

[45]  Gunnar Rätsch,et al.  Active Learning with Support Vector Machines in the Drug Discovery Process , 2003, J. Chem. Inf. Comput. Sci..

[46]  Hidetoshi Shimodaira,et al.  Active learning algorithm using the maximum weighted log-likelihood estimator , 2003 .

[47]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[48]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[49]  Motoaki Kawanabe,et al.  Trading Variance Reduction with Unbiasedness: The Regularized Subspace Information Criterion for Robust Model Selection in Kernel Regression , 2004, Neural Computation.

[50]  B. Ripley,et al.  Robust Statistics , 2018, Wiley Series in Probability and Statistics.

[51]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[52]  Robert A. Lordo,et al.  Nonparametric and Semiparametric Models , 2005, Technometrics.