Minimum Description Length Penalization for Group and Multi-Task Sparse Learning

We propose a framework MIC (Multiple Inclusion Criterion) for learning sparse models based on the information theoretic Minimum Description Length (MDL) principle. MIC provides an elegant way of incorporating arbitrary sparsity patterns in the feature space by using two-part MDL coding schemes. We present MIC based models for the problems of grouped feature selection (MIC-GROUP) and multi-task feature selection (MIC-MULTI). MIC-GROUP assumes that the features are divided into groups and induces two level sparsity, selecting a subset of the feature groups, and also selecting features within each selected group. MIC-MULTI applies when there are multiple related tasks that share the same set of potentially predictive features. It also induces two level sparsity, selecting a subset of the features, and then selecting which of the tasks each feature should be added to. Lastly, we propose a model, TRANSFEAT, that can be used to transfer knowledge from a set of previously learned tasks to a new task that is expected to share similar features. All three methods are designed for selecting a small set of predictive features from a large pool of candidate features. We demonstrate the effectiveness of our approach with experimental results on data from genomics and from word sense disambiguation problems.

[1]  Martin J. Wainwright,et al.  Sharp Thresholds for High-Dimensional and Noisy Sparsity Recovery Using $\ell _{1}$ -Constrained Quadratic Programming (Lasso) , 2009, IEEE Transactions on Information Theory.

[2]  Tong Zhang,et al.  On the Consistency of Feature Selection using Greedy Least Squares Regression , 2009, J. Mach. Learn. Res..

[3]  Dean P. Foster,et al.  Feature Selection using Multiple Streams , 2010, AISTATS.

[4]  Martha Palmer,et al.  Class-Based Construction of a Verb Lexicon , 2000, AAAI/IAAI.

[5]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[6]  Peter Grünwald,et al.  A tutorial introduction to the minimum description length principle , 2004, ArXiv.

[7]  A. Rinaldo,et al.  On the asymptotic properties of the group lasso estimator for linear models , 2008 .

[8]  Tong Zhang,et al.  Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models , 2008, NIPS.

[9]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[10]  P. Zhao,et al.  Grouped and Hierarchical Model Selection through Composite Absolute Penalties , 2007 .

[11]  Barbara B. Levin,et al.  English verb classes and alternations , 1993 .

[12]  Martha Palmer,et al.  Verbnet: a broad-coverage, comprehensive verb lexicon , 2005 .

[13]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[14]  Harrison H. Zhou,et al.  Model selection and sharp asymptotic minimaxity , 2013 .

[15]  Martha Palmer,et al.  An Empirical Study of the Behavior of Active Learning for Word Sense Disambiguation , 2006, NAACL.

[16]  R. F.,et al.  Mathematical Statistics , 1944, Nature.

[17]  Joel A. Tropp,et al.  Greed is good: algorithmic results for sparse approximation , 2004, IEEE Transactions on Information Theory.

[18]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[19]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[20]  Tony Jebara,et al.  Multi-task feature and kernel selection for SVMs , 2004, ICML.

[21]  김두식,et al.  English Verb Classes and Alternations , 2006 .

[22]  Stephen J. Wright,et al.  Simultaneous Variable Selection , 2005, Technometrics.

[23]  Jean-Philippe Vert,et al.  Group lasso with overlap and graph lasso , 2009, ICML '09.

[24]  Martha Palmer,et al.  Towards Robust High Performance Word Sense Disambiguation of English Verbs Using Rich Linguistic Features , 2005, IJCNLP.

[25]  Mitchell P. Marcus,et al.  OntoNotes: The 90% Solution , 2006, NAACL.

[26]  Han Liu,et al.  On the ℓ 1 -ℓ q Regularized Regression , 2008 .

[27]  Balas K. Natarajan,et al.  Sparse Approximate Solutions to Linear Systems , 1995, SIAM J. Comput..

[28]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[29]  Lyle H. Ungar,et al.  Transfer Learning, Feature Selection and Word Sense Disambiguation , 2009, ACL/IJCNLP.

[30]  CRAIG M. PEASE In Defense of N > 1 , 2005 .

[31]  Junzhou Huang,et al.  Learning with structured sparsity , 2009, ICML '09.

[32]  Dean P. Foster,et al.  Multi-task Feature Selection Using the Multiple Inclusion Criterion (MIC) , 2009, ECML/PKDD.

[33]  Jorma Rissanen,et al.  Hypothesis Selection and Testing by the MDL Principle , 1999, Comput. J..

[34]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[35]  Rajat Raina,et al.  Constructing informative priors using transfer learning , 2006, ICML.

[36]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[37]  David Yarowsky,et al.  Modeling Consensus: Classifier Combination for Word Sense Disambiguation , 2002, EMNLP.

[38]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[39]  Vasileios Kandylas,et al.  Finding cohesive clusters for analyzing knowledge communities , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[40]  Karim Lounici Sup-norm convergence rate and sign concentration property of Lasso and Dantzig estimators , 2008, 0801.4610.

[41]  Andrew R. Barron,et al.  Minimum complexity density estimation , 1991, IEEE Trans. Inf. Theory.

[42]  Peter Secretan Learning , 1965, Mental Health.

[43]  Stuart L Schreiber,et al.  Genetic basis of individual differences in the response to small-molecule drugs in yeast , 2007, Nature Genetics.

[44]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[45]  Dean P. Foster,et al.  The risk inflation criterion for multiple regression , 1994 .

[46]  Massimiliano Pontil,et al.  Convex multi-task feature learning , 2008, Machine Learning.

[47]  Rie Kubota Ando,et al.  Applying Alternating Structure Optimization to Word Sense Disambiguation , 2006, CoNLL.

[48]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[49]  Ben Taskar,et al.  Joint covariate selection and joint subspace selection for multiple classification problems , 2010, Stat. Comput..

[50]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[51]  Tong Zhang,et al.  On the Convergence of MDL Density Estimation , 2004, COLT.

[52]  M. Daly,et al.  PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes , 2003, Nature Genetics.

[53]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[54]  Rich Caruana,et al.  Multitask Learning , 1997, Machine-mediated learning.

[55]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[56]  Shuheng Zhou,et al.  Thresholding Procedures for High Dimensional Variable Selection and Statistical Estimation , 2009, NIPS.

[57]  Daphne Koller,et al.  Learning a meta-level prior for feature relevance from multiple related tasks , 2007, ICML '07.

[58]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[59]  Dean P. Foster,et al.  Efficient Feature Selection in the Presence of Multiple Feature Classes , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[60]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[61]  Dekang Lin,et al.  Review of WordNet: an electronic lexical database by Christiane Fellbaum. The MIT Press 1998. , 1999 .

[62]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[63]  Francis R. Bach,et al.  Consistency of the group Lasso and multiple kernel learning , 2007, J. Mach. Learn. Res..

[64]  Michael I. Jordan,et al.  Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.