Transfer Learning Using Feature Selection

We present three related ways of using Transfer Learning to improve feature selection. The three methods address different problems, and hence share different kinds of information between tasks or feature classes, but all three are based on the information theoretic Minimum Description Length (MDL) principle and share the same underlying Bayesian interpretation. The first method, MIC, applies when predictive models are to be built simultaneously for multiple tasks (``simultaneous transfer'') that share the same set of features. MIC allows each feature to be added to none, some, or all of the task models and is most beneficial for selecting a small set of predictive features from a large pool of features, as is common in genomic and biological datasets. Our second method, TPC (Three Part Coding), uses a similar methodology for the case when the features can be divided into feature classes. Our third method, Transfer-TPC, addresses the ``sequential transfer'' problem in which the task to which we want to transfer knowledge may not be known in advance and may have different amounts of data than the other tasks. Transfer-TPC is most beneficial when we want to transfer knowledge between tasks which have unequal amounts of labeled data, for example the data for disambiguating the senses of different verbs. We demonstrate the effectiveness of these approaches with experimental results on real world data pertaining to genomics and to Word Sense Disambiguation (WSD).

[1]  Martha Palmer,et al.  Class-Based Construction of a Verb Lexicon , 2000, AAAI/IAAI.

[2]  S. Sarkar,et al.  The Simes Method for Multiple Hypothesis Testing with Positively Dependent Test Statistics , 1997 .

[3]  Barbara B. Levin,et al.  English verb classes and alternations , 1993 .

[4]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[5]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[6]  Jing Zhou,et al.  Streamwise Feature Selection , 2006, J. Mach. Learn. Res..

[7]  Rajat Raina,et al.  Constructing informative priors using transfer learning , 2006, ICML.

[8]  Martha Palmer,et al.  Verbnet: a broad-coverage, comprehensive verb lexicon , 2005 .

[9]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[10]  Francis R. Bach,et al.  Consistency of the group Lasso and multiple kernel learning , 2007, J. Mach. Learn. Res..

[11]  Jorma Rissanen,et al.  Hypothesis Selection and Testing by the MDL Principle , 1999, Comput. J..

[12]  M. Daly,et al.  PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes , 2003, Nature Genetics.

[13]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[14]  Daphne Koller,et al.  Learning a meta-level prior for feature relevance from multiple related tasks , 2007, ICML '07.

[15]  Bo-Juen Chen,et al.  Modularity and interactions in the genetics of gene expression , 2009, Proceedings of the National Academy of Sciences.

[16]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[17]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[18]  Balas K. Natarajan,et al.  Sparse Approximate Solutions to Linear Systems , 1995, SIAM J. Comput..

[19]  Jean-Philippe Vert,et al.  Clustered Multi-Task Learning: A Convex Formulation , 2008, NIPS.

[20]  Naftali Tishby,et al.  Learning to Select Features using their Properties , 2008 .

[21]  Erich Novak,et al.  Special issue , 2006, J. Complex..

[22]  Martha Palmer,et al.  An Empirical Study of the Behavior of Active Learning for Word Sense Disambiguation , 2006, NAACL.

[23]  R. Tibshirani,et al.  Sparsity and smoothness via the fused lasso , 2005 .

[24]  김두식,et al.  English Verb Classes and Alternations , 2006 .

[25]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[26]  Vasileios Kandylas,et al.  Finding cohesive clusters for analyzing knowledge communities , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[27]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[28]  Dekang Lin,et al.  Review of WordNet: an electronic lexical database by Christiane Fellbaum. The MIT Press 1998. , 1999 .

[29]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[30]  Martha Palmer,et al.  Towards Robust High Performance Word Sense Disambiguation of English Verbs Using Rich Linguistic Features , 2005, IJCNLP.

[31]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[32]  Dean P. Foster,et al.  The risk inflation criterion for multiple regression , 1994 .

[33]  R. Simes,et al.  An improved Bonferroni procedure for multiple tests of significance , 1986 .

[34]  Shai Ben-David,et al.  A notion of task relatedness yielding provable multiple-task learning guarantees , 2008, Machine Learning.

[35]  Massimiliano Pontil,et al.  Convex multi-task feature learning , 2008, Machine Learning.

[36]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[37]  Ben Taskar,et al.  Joint covariate selection and joint subspace selection for multiple classification problems , 2010, Stat. Comput..

[38]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .