Probabilistic Joint Feature Selection for Multi-task Learning

We study the joint feature selection problem when learning multiple related classification or regression tasks. By imposing an automatic relevance determination prior on the hypothesis classes associated with each of the tasks and regularizing the variance of the hypothesis parameters, similar feature patterns across different tasks are encouraged and features that are relevant to all (or most) of the tasks are identified. Our analysis shows that the proposed probabilistic framework can be seen as a generalization of previous result from adaptive ridge regression to the multi-task learning setting. We provide a detailed description of the proposed algorithms for simultaneous model construction and justify the proposed algorithms in several aspects. Our experimental results show that this approach outperforms a regularized multi-task learning approach and the traditional methods where individual tasks are solved independently on synthetic data and the real-world data sets for lung cancer prognosis.

[1]  Hans C. van Houwelingen,et al.  The Elements of Statistical Learning, Data Mining, Inference, and Prediction. Trevor Hastie, Robert Tibshirani and Jerome Friedman, Springer, New York, 2001. No. of pages: xvi+533. ISBN 0‐387‐95284‐5 , 2004 .

[2]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[3]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[4]  Tong Zhang,et al.  Text Categorization Based on Regularized Linear Classification Methods , 2001, Information Retrieval.

[5]  Lawrence Carin,et al.  Learning Multiple Classifiers with Dirichlet Process Mixture Priors , 2005 .

[6]  Joshua Goodman,et al.  Exponential Priors for Maximum Entropy Models , 2004, NAACL.

[7]  T. Minka A comparison of numerical optimizers for logistic regression , 2004 .

[8]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[9]  William Moore,et al.  Computer-aided diagnosis: impact on nodule detection among community level radiologists. A multi-reader study , 2004, CARS.

[10]  David Mackay,et al.  Probable networks and plausible predictions - a review of practical Bayesian methods for supervised neural networks , 1995 .

[11]  Yves Grandvalet,et al.  Outcomes of the Equivalence of Adaptive Ridge with Least Absolute Shrinkage , 1998, NIPS.

[12]  Tony Jebara,et al.  Multi-task feature and kernel selection for SVMs , 2004, ICML.

[13]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[14]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[15]  W. Greene,et al.  计量经济分析 = Econometric analysis , 2009 .

[16]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevan e Ve tor Ma hine , 2001 .

[17]  Michael E. Tipping The Relevance Vector Machine , 1999, NIPS.

[18]  Tom Heskes,et al.  Empirical Bayes for Learning to Learn , 2000, ICML.

[19]  Rob Malouf,et al.  A Comparison of Algorithms for Maximum Entropy Parameter Estimation , 2002, CoNLL.

[20]  Anton Schwaighofer,et al.  Learning Gaussian processes from multiple tasks , 2005, ICML.

[21]  S. Armato,et al.  Automated detection of lung nodules in CT scans: preliminary results. , 2001, Medical physics.

[22]  Huan Liu,et al.  Incremental Feature Selection , 1998, Applied Intelligence.

[23]  Kenji Suzuki,et al.  Radiologic classification of small adenocarcinoma of the lung: radiologic-pathologic correlation and its prognostic impact. , 2006, The Annals of thoracic surgery.

[24]  Yves Grandvalet Least Absolute Shrinkage is Equivalent to Quadratic Penalization , 1998 .

[25]  Honglak Lee,et al.  Efficient L1 Regularized Logistic Regression , 2006, AAAI.

[26]  Massimiliano Pontil,et al.  Regularized multi--task learning , 2004, KDD.