Learning under differing training and test distributions

Abstract One of the main problems in machine learning is to train a predictive model from train-ing data and to make predictions on test data. Most predictive models are constructedunder the assumption that the training data is governed by the exact same distributionwhich the model will later be exposed to. In practice, control over the data collectionprocess is often imperfect. A typical scenario is when labels are collected by question-naires and one does not have access to the test population. For example, parts of thetest population are underrepresented in the survey, out of reach, or do not return thequestionnaire. In many applications training data from the test distribution are scarcebecause they are difficult to obtain or very expensive. Data from auxiliary sourcesdrawn from similar distributions are often cheaply available.This thesis centers around learning under differing training and test distributionsand covers several problem settings with different assumptions on the relationship be-tween training and test distributions—including multi-task learning and learning undercovariate shift and sample selection bias. Several new models are derived that directlycharacterize the divergence between training and test distributions, without the inter-mediate step of estimating training and test distributions separately. The integral partof these models are rescaling weights that match the rescaled or resampled trainingdistribution to the test distribution. Integrated models are studied where only one op-timization problem needs to be solved for learning under differing distributions. Witha two-step approximation to the integrated models almost any supervised learning al-gorithm can be adopted to biased training data.In case studies on spam filtering, HIV therapy screening, targeted advertising, andother applications the performance of the new models is compared to state-of-the-artreference methods.

[1]  Bianca Zadrozny,et al.  Learning and evaluating classifiers under sample selection bias , 2004, ICML.

[2]  Mehryar Mohri,et al.  Sample Selection Bias Correction Theory , 2008, ALT.

[3]  Lawrence Carin,et al.  Multi-Task Learning for Classification with Dirichlet Process Priors , 2007, J. Mach. Learn. Res..

[4]  Massimiliano Pontil,et al.  Multi-Task Feature Learning , 2006, NIPS.

[5]  Steffen Bickel,et al.  Discriminative learning for differing training and test distributions , 2007, ICML '07.

[6]  Murat Dundar,et al.  An Improved Multi-task Learning Approach with Applications in Medical Diagnosis , 2008, ECML/PKDD.

[7]  Steffen Bickel,et al.  Discriminative Learning under Covariate Shift with a Single Optimization Problem , 2009 .

[8]  Steffen Bickel,et al.  Dirichlet-Enhanced Spam Filtering based on Biased Samples , 2006, NIPS.

[9]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[10]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[11]  Lawrence Carin,et al.  Semi-Supervised Multitask Learning , 2007, NIPS.

[12]  Tom Heskes,et al.  Empirical Bayes for Learning to Learn , 2000, ICML.

[13]  Yiming Yang,et al.  Learning Multiple Related Tasks using Latent Independent Component Analysis , 2005, NIPS.

[14]  M. Kawanabe,et al.  Direct importance estimation for covariate shift adaptation , 2008 .

[15]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[16]  Thomas Lengauer,et al.  Multi-task learning for HIV therapy screening , 2008, ICML '08.

[17]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[18]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[19]  Bernhard Schölkopf,et al.  Correcting Sample Selection Bias by Unlabeled Data , 2006, NIPS.

[20]  J. Heckman Sample selection bias as a specification error , 1979 .

[21]  H. Shimodaira,et al.  Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[22]  Irving John Good,et al.  The Estimation of Probabilities: An Essay on Modern Bayesian Methods , 1965 .

[23]  Masashi Sugiyama,et al.  Direct Density Ratio Estimation for Large-scale Covariate Shift Adaptation , 2008, SDM.

[24]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[25]  Masashi Sugiyama,et al.  Input-dependent estimation of generalization error under covariate shift , 2005 .

[26]  Volker Tresp,et al.  Robust multi-task learning with t-processes , 2007, ICML '07.

[27]  Hein Matthias Binary Classification under Sample Selection Bias , 2008 .

[28]  Massimiliano Pontil,et al.  Regularized multi--task learning , 2004, KDD.

[29]  Edwin V. Bonilla,et al.  Kernel Multi-task Learning using Task-specific Features , 2007, AISTATS.

[30]  Miroslav Dudík,et al.  Hierarchical maximum entropy density estimation , 2007, ICML '07.

[31]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[32]  Tom Heskes,et al.  Task Clustering and Gating for Bayesian Multitask Learning , 2003, J. Mach. Learn. Res..

[33]  Motoaki Kawanabe,et al.  Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation , 2007, NIPS.

[34]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[35]  R. Pyke,et al.  Logistic disease incidence models and case-control studies , 1979 .

[36]  Steffen Bickel,et al.  Transfer Learning by Distribution Matching for Targeted Advertising , 2008, NIPS.

[37]  Richard H. Lathrop,et al.  Combinatorial Optimization in Rapidly Mutating Drug-Resistant Viruses , 1999, J. Comb. Optim..

[38]  Thomas Lengauer,et al.  Selecting anti-HIV therapies based on a variety of genomic and clinical factors , 2008, ISMB.

[39]  Thomas G. Dietterich,et al.  Improving SVM accuracy by training on auxiliary data sources , 2004, ICML.

[40]  Ji Zhu,et al.  Kernel Logistic Regression and the Import Vector Machine , 2001, NIPS.

[41]  Chih-Jen Lin,et al.  Trust region Newton methods for large-scale logistic regression , 2007, ICML '07.

[42]  Charles Elkan,et al.  Making generative classifiers robust to selection bias , 2007, KDD '07.

[43]  Samuel Kaski,et al.  Learning from Relevant Tasks Only , 2007, ECML.

[44]  Nataliya Sokolovska,et al.  The asymptotics of semi-supervised learning in discriminative probabilistic models , 2008, ICML '08.

[45]  Leslie Pack Kaelbling,et al.  Efficient Bayesian Task-Level Transfer Learning , 2007, IJCAI.

[46]  D. Richman,et al.  Update of the drug resistance mutations in HIV-1: 2007. , 2007, Topics in HIV medicine : a publication of the International AIDS Society, USA.

[47]  Christopher K. I. Williams Learning Kernel Classifiers , 2003 .

[48]  Anton Schwaighofer,et al.  Learning Gaussian processes from multiple tasks , 2005, ICML.

[49]  Edwin V. Bonilla,et al.  Multi-task Gaussian Process Prediction , 2007, NIPS.

[50]  Anton Schwaighofer,et al.  Learning Gaussian Process Kernels via Hierarchical Bayes , 2004, NIPS.

[51]  J. Lunceford,et al.  Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study , 2004, Statistics in medicine.

[52]  Michael I. Jordan,et al.  Multi-task feature selection , 2006 .

[53]  Lawrence Carin,et al.  Logistic regression with an auxiliary data source , 2005, ICML.

[54]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[55]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[56]  Miroslav Dudík,et al.  Correcting sample selection bias in maximum entropy density estimation , 2005, NIPS.

[57]  Masashi Sugiyama,et al.  Binary Classification under Sample Selection Bias , 2009 .

[58]  Thomas Lengauer,et al.  Improved Prediction of Response to Antiretroviral Combination Therapy using the Genetic Barrier to Drug Resistance , 2006, Antiviral therapy.

[59]  JapkowiczNathalie,et al.  The class imbalance problem: A systematic study , 2002 .

[60]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[61]  Charles Elkan,et al.  A Bayesian network framework for reject inference , 2004, KDD.

[62]  Steven R. Lerman,et al.  The Estimation of Choice Probabilities from Choice Based Samples , 1977 .

[63]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[64]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[65]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[66]  Brendan Larder,et al.  The Development of Artificial Neural Networks to Predict Virological response to Combination HIV Therapy , 2007, Antiviral therapy.

[67]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[68]  D. Richman,et al.  2022 update of the drug resistance mutations in HIV-1. , 2022, Topics in antiviral medicine.