D-Confidence: an active learning strategy to reduce label disclosure complexity in the presence of imbalanced class distributions

In some classification tasks, such as those related to the automatic building and maintenance of text corpora, it is expensive to obtain labeled instances to train a classifier. In such circumstances it is common to have massive corpora where a few instances are labeled (typically a minority) while others are not. Semi-supervised learning techniques try to leverage the intrinsic information in unlabeled instances to improve classification models. However, these techniques assume that the labeled instances cover all the classes to learn which might not be the case. Moreover, when in the presence of an imbalanced class distribution, getting labeled instances from minority classes might be very costly, requiring extensive labeling, if queries are randomly selected. Active learning allows asking an oracle to label new instances, which are selected by criteria, aiming to reduce the labeling effort. D-Confidence is an active learning approach that is effective when in presence of imbalanced training sets. In this paper we evaluate the performance of d-Confidence in comparison to its baseline criteria over tabular and text datasets. We provide empirical evidence that d-Confidence reduces label disclosure complexity—which we have defined as the number of queries required to identify instances from all classes to learn—when in the presence of imbalanced data.

[1]  Klaus Brinker,et al.  Incorporating Diversity in Active Learning with Support Vector Machines , 2003, ICML.

[2]  Andrew McCallum,et al.  Piecewise pseudolikelihood for efficient training of conditional random fields , 2007, ICML '07.

[3]  Andrew Trotman,et al.  Sound and complete relevance assessment for XML retrieval , 2008, TOIS.

[4]  Christopher Johns On-Line News , 1997 .

[5]  Alípio Mário Jorge,et al.  Efficient Coverage of Case Space with Active Learning , 2009, EPIA.

[6]  C. Bonwell,et al.  Active learning : creating excitement in the classroom , 1991 .

[7]  Rong Jin,et al.  Large-scale text categorization by batch mode active learning , 2006, WWW '06.

[8]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[9]  David A. Cohn,et al.  Training Connectionist Networks with Queries and Selective Sampling , 1989, NIPS.

[10]  Alípio Mário Jorge,et al.  Semi-automatic Creation and Maintenance of Web Resources with webTopic , 2005, EWMF/KDO.

[11]  Ion Muslea,et al.  Active Learning with Multiple Views , 2009, Encyclopedia of Data Warehousing and Mining.

[12]  Dana Angluin,et al.  Queries and concept learning , 1988, Machine Learning.

[13]  Ishwar K. Sethi,et al.  Confidence-based active learning , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[15]  Wei Hu,et al.  Unsupervised Active Learning Based on Hierarchical Graph-Theoretic Clustering , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[16]  Chaomei Chen,et al.  Mining the Web: Discovering knowledge from hypertext data , 2004, J. Assoc. Inf. Sci. Technol..

[17]  Greg Schohn,et al.  Less is More: Active Learning with Support Vector Machines , 2000, ICML.

[18]  Nils J. Nilsson,et al.  Artificial Intelligence , 1974, IFIP Congress.

[19]  Ian H. Witten,et al.  Clustering Documents with Active Learning Using Wikipedia , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[20]  David B. Shmoys,et al.  A Best Possible Heuristic for the k-Center Problem , 1985, Math. Oper. Res..

[21]  Maria-Florina Balcan,et al.  Agnostic active learning , 2006, J. Comput. Syst. Sci..

[22]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[23]  Steve Hanneke,et al.  A bound on the label complexity of agnostic active learning , 2007, ICML '07.

[24]  David A. Cohn,et al.  Improving generalization with active learning , 1994, Machine Learning.

[25]  John Langford,et al.  Agnostic active learning , 2006, J. Comput. Syst. Sci..

[26]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[27]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[28]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[29]  Nuno Escudeiro,et al.  D-Confidence: An Active Learning Strategy which Efficiently Identifies Small Classes , 2010, HLT-NAACL 2010.

[30]  Sanjoy Dasgupta,et al.  Hierarchical sampling for active learning , 2008, ICML '08.

[31]  Sanjoy Dasgupta,et al.  Coarse sample complexity bounds for active learning , 2005, NIPS.

[32]  Diego Sona,et al.  Clustering documents into a web directory for bootstrapping a supervised classification , 2005, Data Knowl. Eng..

[33]  Arnold W. M. Smeulders,et al.  Active learning using pre-clustering , 2004, ICML.

[34]  Matti Kääriäinen,et al.  Active Learning in the Non-realizable Case , 2006, ALT.

[35]  Eric B. Baum,et al.  Neural net algorithms that learn in polynomial time from examples and queries , 1991, IEEE Trans. Neural Networks.

[36]  D LewisDavid A sequential algorithm for training text classifiers , 1995 .

[37]  Dunja Mladenic,et al.  Semantics, Web and Mining , 2008 .

[38]  Huan Liu,et al.  Instance Selection and Construction for Data Mining , 2001 .

[39]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[40]  Rong Jin,et al.  Semisupervised SVM batch mode active learning with applications to image retrieval , 2009, TOIS.