Covariate Shift Adaptation by Importance Weighted Cross Validation

A common assumption in supervised learning is that the input points in the training set follow the same probability distribution as the input points that will be given in the future test phase. However, this assumption is not satisfied, for example, when the outside of the training region is extrapolated. The situation where the training input points and test input points follow different distributions while the conditional distribution of output values given input points is unchanged is called the covariate shift. Under the covariate shift, standard model selection techniques such as cross validation do not work as desired since its unbiasedness is no longer maintained. In this paper, we propose a new method called importance weighted cross validation (IWCV), for which we prove its unbiasedness even under the covariate shift. The IWCV procedure is the only one that can be applied for unbiased classification under covariate shift, whereas alternatives to IWCV exist for regression. The usefulness of our proposed method is illustrated by simulations, and furthermore demonstrated in the brain-computer interface, where strong non-stationarity effects can be seen between training and test sessions.

[1]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[2]  W. J. Studden,et al.  Theory Of Optimal Experiments , 1972 .

[3]  H. Akaike A new look at the statistical model identification , 1974 .

[4]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[5]  M. Stone Asymptotics for and against cross-validation , 1977 .

[6]  J. Heckman Sample selection bias as a specification error , 1979 .

[7]  W. Greene Sample Selection Bias as a Specification Error: Comment , 1981 .

[8]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[9]  David J. C. MacKay,et al.  Information-Based Objective Functions for Active Data Selection , 1992, Neural Computation.

[10]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[11]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[12]  W. Näther Optimum experimental designs , 1994 .

[13]  Michael Jackson,et al.  Optimal Design of Experiments , 1994 .

[14]  Ronald L. Wasserstein,et al.  Monte Carlo: Concepts, Algorithms, and Applications , 1997 .

[15]  N. Altman,et al.  On the Optimality of Prediction‐based Selection Criteria and the Convergence Rates of Estimators , 1997 .

[16]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[17]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[18]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[19]  F. L. D. Silva,et al.  Event-related EEG/MEG synchronization and desynchronization: basic principles , 1999, Clinical Neurophysiology.

[20]  H. Shimodaira,et al.  Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[21]  F Babiloni,et al.  Linear classification of low-resolution EEG patterns produced by imagined hand movements. , 2000, IEEE transactions on rehabilitation engineering : a publication of the IEEE Engineering in Medicine and Biology Society.

[22]  Kenji Fukumizu,et al.  Statistical active learning in multilayer perceptrons , 2000, IEEE Trans. Neural Networks Learn. Syst..

[23]  D. Wiens Robust weights and designs for biased regression models: Least squares and generalized M-estimation , 2000 .

[24]  Gregory R. Grant,et al.  Bioinformatics - The Machine Learning Approach , 2000, Comput. Chem..

[25]  G. Pfurtscheller,et al.  Optimal spatial filtering of single trial EEG during imagined hand movement. , 2000, IEEE transactions on rehabilitation engineering : a publication of the IEEE Engineering in Medicine and Biology Society.

[26]  Masashi Sugiyama,et al.  Subspace Information Criterion for Model Selection , 2001, Neural Computation.

[27]  Sumio Watanabe,et al.  Algebraic Analysis for Nonidentifiable Learning Machines , 2001, Neural Computation.

[28]  Christian R. Shelton,et al.  Importance sampling for reinforcement learning with multiple objectives , 2001 .

[29]  Gunnar Rätsch,et al.  An Introduction to Boosting and Leveraging , 2002, Machine Learning Summer School.

[30]  Motoaki Kawanabe,et al.  On-line learning in changing environments with applications in supervised and unsupervised learning , 2002, Neural Networks.

[31]  Gustavo A. Stolovitzky,et al.  Bioinformatics: The Machine Learning Approach , 2002 .

[32]  Masashi Sugiyama,et al.  Active Learning with Model Selection — Simultaneous Optimization of Sample Points and Models for Trigonometric Polynomial Models , 2003 .

[33]  Xiaofei He,et al.  Locality Preserving Projections , 2003, NIPS.

[34]  Matthias Hein,et al.  Measure Based Regularization , 2003, NIPS.

[35]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[36]  Hidetoshi Shimodaira,et al.  Active learning algorithm using the maximum weighted log-likelihood estimator , 2003 .

[37]  Klaus-Robert Müller,et al.  Boosting bit rates in noninvasive EEG single-trial classifications by feature combination and multiclass paradigms , 2004, IEEE Transactions on Biomedical Engineering.

[38]  Jdel.R. Millan,et al.  On the need for on-line learning in brain-computer interfaces , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[39]  Nitesh V. Chawla,et al.  SPECIAL ISSUE ON LEARNING FROM IMBALANCED DATA SETS , 2004 .

[40]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[41]  Bianca Zadrozny,et al.  Learning and evaluating classifiers under sample selection bias , 2004, ICML.

[42]  Yi Lin,et al.  Support Vector Machines for Classification in Nonstandard Situations , 2002, Machine Learning.

[43]  Ludmila I. Kuncheva,et al.  Classifier Ensembles for Changing Environments , 2004, Multiple Classifier Systems.

[44]  Yi Lin,et al.  Support Vector Machines and the Bayes Rule in Classification , 2002, Data Mining and Knowledge Discovery.

[45]  Mikhail Belkin,et al.  Semi-Supervised Learning on Riemannian Manifolds , 2004, Machine Learning.

[46]  Masashi Sugiyama,et al.  Input-dependent estimation of generalization error under covariate shift , 2005 .

[47]  Robert A. Lordo,et al.  Nonparametric and Semiparametric Models , 2005, Technometrics.

[48]  Philip S. Yu,et al.  An improved categorization of classifier's sensitivity on sample selection bias , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[49]  G Pfurtscheller,et al.  Adaptive On-line Classification for EEG-based Brain Computer Interfaces with AAR parameters and band power estimates / Adaptive On-line Classification einer EEG-basierenden Gehirn-Computer Schnittstelle mit Adaptive Autoregressiven und Bandleistungsparametern , 2005, Biomedizinische Technik. Biomedical engineering.

[50]  Klaus-Robert Müller,et al.  Spatio-spectral filters for improving the classification of single trial EEG , 2005, IEEE Transactions on Biomedical Engineering.

[51]  Francis R. Bach,et al.  Active learning for misspecified generalized linear models , 2006, NIPS.

[52]  Masashi Sugiyama,et al.  Active Learning in Approximately Linear Regression Based on Conditional Expectation of Generalization Error , 2006, J. Mach. Learn. Res..

[53]  Matthias Hein,et al.  Uniform Convergence of Adaptive Graph-Based Regularization , 2006, COLT.

[54]  Daniel Marcu,et al.  Domain Adaptation for Statistical Classifiers , 2006, J. Artif. Intell. Res..

[55]  Masashi Sugiyama,et al.  Mixture Regression for Covariate Shift , 2006, NIPS.

[56]  Masashi Sugiyama,et al.  Local Fisher discriminant analysis for supervised dimensionality reduction , 2006, ICML.

[57]  Koby Crammer,et al.  Analysis of Representations for Domain Adaptation , 2006, NIPS.

[58]  Hans-Peter Kriegel,et al.  Integrating structured biological data by Kernel Maximum Mean Discrepancy , 2006, ISMB.

[59]  Bernhard Schölkopf,et al.  Correcting Sample Selection Bias by Unlabeled Data , 2006, NIPS.

[60]  Rajesh P. N. Rao,et al.  Towards adaptive classification for BCI , 2006, Journal of neural engineering.

[61]  Steffen Bickel,et al.  Dirichlet-Enhanced Spam Filtering based on Biased Samples , 2006, NIPS.

[62]  K.-R. Muller,et al.  The Berlin brain-computer interface: EEG-based communication without subject training , 2006, IEEE Transactions on Neural Systems and Rehabilitation Engineering.

[63]  H. Robbins A Stochastic Approximation Method , 1951 .

[64]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[65]  Motoaki Kawanabe,et al.  Asymptotic Bayesian generalization error when training and test distributions are different , 2007, ICML '07.

[66]  Klaus-Robert Müller,et al.  The non-invasive Berlin Brain–Computer Interface: Fast acquisition of effective performance in untrained subjects , 2007, NeuroImage.

[67]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[68]  Masashi Sugiyama,et al.  A batch ensemble approach to active learning with model selection , 2008, Neural Networks.