Reducing Weight Undertraining in Structured Discriminative Learning

Discriminative probabilistic models are very popular in NLP because of the latitude they afford in designing features. But training involves complex trade-offs among weights, which can be dangerous: a few highly-indicative features can swamp the contribution of many individually weaker features, causing their weights to be undertrained. Such a model is less robust, for the highly-indicative features may be noisy or missing in the test data. To ameliorate this weight undertraining, we introduce several new feature bagging methods, in which separate models are trained on subsets of the original features, and combined using a mixture model or a product of experts. These methods include the logarithmic opinion pools used by Smith et al. (2005). We evaluate feature bagging on linear-chain conditional random fields for two natural-language tasks. On both tasks, the feature-bagged CRF performs better than simply training a single CRF on all the features.

[1]  D. G. Brennan Linear Diversity Combining Techniques , 1959, Proceedings of the IRE.

[2]  Eric Brill,et al.  Some Advances in Transformation-Based Part of Speech Tagging , 1994, AAAI.

[3]  Dean A. Pomerleau,et al.  Neural Network Vision for Robot Driving , 1997 .

[4]  Stephen D. Bay Combining Nearest Neighbor Classifiers Through Multiple Feature Subsets , 1998, ICML.

[5]  Khaled Ben Letaief,et al.  Multiuser OFDM with adaptive subcarrier, bit, and power allocation , 1999, IEEE J. Sel. Areas Commun..

[6]  J. Langford,et al.  FeatureBoost: A Meta-Learning Algorithm that Improves Model Robustness , 2000, ICML.

[7]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[8]  Andrea J. Goldsmith,et al.  Capacity and optimal resource allocation for fading broadcast channels - Part I: Ergodic capacity , 2001, IEEE Trans. Inf. Theory.

[9]  Yuji Matsumoto,et al.  Chunking with Support Vector Machines , 2001, NAACL.

[10]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[11]  Thomas Hofmann,et al.  Discriminative Learning for Label Sequences via Boosting , 2002, NIPS.

[12]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[13]  Hwee Tou Ng,et al.  Named Entity Recognition with a Maximum Entropy Approach , 2003, CoNLL.

[14]  Francis K. H. Quek,et al.  Attribute bagging: improving accuracy of classifier ensembles by using random feature subsets , 2003, Pattern Recognit..

[15]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[16]  Tong Zhang,et al.  Named Entity Recognition through Classifier Combination , 2003, CoNLL.

[17]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[18]  Pat Langley,et al.  Editorial: On Machine Learning , 1986, Machine Learning.

[19]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[20]  Ben Taskar,et al.  Max-Margin Parsing , 2004, EMNLP.

[21]  Risto Wichman,et al.  Performance of multiuser diversity in the presence of feedback errors , 2004, 2004 IEEE 15th International Symposium on Personal, Indoor and Mobile Radio Communications (IEEE Cat. No.04TH8754).

[22]  Gregory W. Wornell,et al.  Cooperative diversity in wireless networks: Efficient protocols and outage behavior , 2004, IEEE Transactions on Information Theory.

[23]  J. Brouet,et al.  Compression of associated signaling for adaptive multi-carrier systems , 2004, 2004 IEEE 59th Vehicular Technology Conference. VTC 2004-Spring (IEEE Cat. No.04CH37514).

[24]  David Tse,et al.  Fundamentals of Wireless Communication , 2005 .

[25]  Tong Zhang,et al.  A High-Performance Semi-Supervised Learning Method for Text Chunking , 2005, ACL.

[26]  Trevor Cohn,et al.  Logarithmic Opinion Pools for Conditional Random Fields , 2005, ACL.

[27]  Jeffrey G. Andrews,et al.  Broadband wireless access with WiMax/802.16: current performance benchmarks and future potential , 2005, IEEE Communications Magazine.

[28]  Andrew Smith,et al.  Using Gazetteers in Discriminative Information Extraction , 2006, CoNLL.

[29]  Adam Wolisz,et al.  Performance analysis of dynamic OFDMA systems with inband signaling , 2006, IEEE Journal on Selected Areas in Communications.

[30]  Laurence B. Milstein,et al.  Analysis of multiuser diversity in time-varying channels , 2007, IEEE Transactions on Wireless Communications.

[31]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .