Creating ensemble of diverse maximum entropy models

Diversity of a classifier ensemble has been shown to benefit overall classification performance. But most conventional methods of training ensembles offer no control on the extent of diversity and are meta-learners. We present a method for creating an ensemble of diverse maximum entropy (∂MaxEnt) models, which are popular in speech and language processing. We modify the objective function for conventional training of a MaxEnt model such that its output posterior distribution is diverse with respect to a reference model. Two diversity scores are explored - KL divergence and posterior cross-correlation. Experiments on the CoNLL-2003 Named Entity Recognition task and the IEMOCAP emotion recognition database show the benefits of a ∂MaxEnt ensemble.

[1]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[2]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[3]  Kagan Tumer,et al.  Analysis of decision boundaries in linearly combined neural classifiers , 1996, Pattern Recognit..

[4]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[5]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[6]  Brian Kingsbury,et al.  The IBM 2008 GALE Arabic speech transcription system , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Ludmila I. Kuncheva,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2004 .

[8]  Raymond J. Mooney,et al.  Constructing Diverse Classifier Ensembles using Artificial Training Examples , 2003, IJCAI.

[9]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[10]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[11]  Tara N. Sainath,et al.  Application specific loss minimization using gradient boosting , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[13]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[14]  Naonori Ueda,et al.  Generalization error of ensemble estimators , 1996, Proceedings of International Conference on Neural Networks (ICNN'96).

[15]  Xin Yao,et al.  Ensemble learning via negative correlation , 1999, Neural Networks.

[16]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[17]  James Bennett,et al.  The Netflix Prize , 2007 .

[18]  Jianying Hu,et al.  Winning the KDD Cup Orange Challenge with Ensemble Selection , 2009, KDD Cup.

[19]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.