Learning a Product of Experts with Elitist Lasso

Discriminative models such as logistic regression profit from the ability to incorporate arbitrary rich features; however, complex dependencies among overlapping features can often result in weight undertraining. One popular method that attempts to mitigate this problem is logarithmic opinion pools (LOP), which is a specialized form of product of experts model that automatically adjusts the weighting among experts. A major problem with LOP is that it requires significant amounts of domain expertise in designing effective experts. We propose a novel method that learns to induce experts — not just the weighting between them — through the use of a mixed ‘2‘1 norm as previously seen in elitist lasso. Unlike its more popular sibling ‘1‘2 norm (used in group lasso), which seeks feature sparsity at the group-level, ‘2‘1 norm encourages sparsity within feature groups. We demonstrate how this property can be leveraged as a competition mechanism to induce groups of diverse experts, and introduce a new formulation of elitist lasso MaxEnt in the FOBOS optimization framework (Duchi and Singer, 2009). Results on Named Entity Recognition task suggest that this method gives consistent improvements over a standard logistic regression model, and is more effective than conventional induction schemes for experts.

[1]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[2]  Tom Heskes,et al.  Selecting Weighting Factors in Logarithmic Opinion Pools , 1997, NIPS.

[3]  Lukás Burget,et al.  Empirical Evaluation and Combination of Advanced Language Modeling Techniques , 2011, INTERSPEECH.

[4]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[5]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[6]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[7]  Noah A. Smith,et al.  Graph-Based Lexicon Expansion with Sparsity-Inducing Penalties , 2012, NAACL.

[8]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[9]  François Yvon,et al.  Practical Very Large Scale CRFs , 2010, ACL.

[10]  Andrew McCallum,et al.  Reducing Weight Undertraining in Structured Discriminative Learning , 2006, NAACL.

[11]  Walter Daelemans,et al.  Applying System Combination to Base Noun Phrase Identification , 2000, COLING.

[12]  Max Welling Donald,et al.  Products of Experts , 2007 .

[13]  Trevor Cohn,et al.  Logarithmic Opinion Pools for Conditional Random Fields , 2005, ACL.

[14]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[15]  Bruno Torrésani,et al.  Sparsity and persistence: mixed norms provide simple signal models with dependent coefficients , 2009, Signal Image Video Process..

[16]  Yoram Singer,et al.  Efficient Online and Batch Learning Using Forward Backward Splitting , 2009, J. Mach. Learn. Res..

[17]  Christopher D. Manning,et al.  Fast dropout training , 2013, ICML.

[18]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[19]  John DeNero,et al.  Model Combination for Machine Translation , 2010, HLT-NAACL.

[20]  Noah A. Smith,et al.  Structured Sparsity in Structured Prediction , 2011, EMNLP.

[21]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[22]  M. I. Jordan Leo Breiman , 2011, 1101.0929.