Active Learning for Hierarchical Text Classification

Hierarchical text classification plays an important role in many real-world applications, such as webpage topic classification, product categorization and user feedback classification. Usually a large number of training examples are needed to build an accurate hierarchical classification system. Active learning has been shown to reduce the training examples significantly, but it has not been applied to hierarchical text classification due to several technical challenges. In this paper, we study active learning for hierarchical text classification. We propose a realistic multi-oracle setting as well as a novel active learning framework, and devise several novel leveraging strategies under this new framework. Hierarchical relation between different categories has been explored and leveraged to improve active learning further. Experiments show that our methods are quite effective in reducing the number of oracle queries (by 74% to 90%) in building accurate hierarchical classification systems. As far as we know, this is the first work that studies active learning in hierarchical text classification with promising results.

[1]  Qiang Yang,et al.  Deep classification in large-scale text hierarchies , 2008, SIGIR '08.

[2]  Fabrizio Sebastiani,et al.  Selecting negative examples for hierarchical text classification: An experimental comparison , 2010, J. Assoc. Inf. Sci. Technol..

[3]  Ee-Peng Lim,et al.  Hierarchical text classification and evaluation , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[4]  Andrea Esuli,et al.  Active Learning Strategies for Multi-Label Text Classification , 2009, ECIR.

[5]  Jaime G. Carbonell,et al.  Proactive learning: cost-sensitive active learning with multiple imperfect oracles , 2008, CIKM '08.

[6]  Karin M. Verspoor,et al.  A categorization approach to automated ontological function annotation , 2006, Protein science : a publication of the Protein Society.

[7]  Wai Lam,et al.  Using a generalized instance set for automatic text categorization , 1998, SIGIR '98.

[8]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[9]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[10]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[11]  Michelangelo Ceci,et al.  Classifying web documents in a hierarchy of categories: a comprehensive study , 2007, Journal of Intelligent Information Systems.

[12]  Stefanie Nowak,et al.  How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation , 2010, MIR '10.

[13]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[14]  Xiaowei Xu,et al.  Representative Sampling for Text Classification Using Support Vector Machines , 2003, ECIR.

[15]  Aaron Kershenbaum,et al.  The Effect of Using Hierarchical Classifiers in Text Categorization , 2000, RIAO.

[16]  Zheng Chen,et al.  Effective multi-label active learning for text classification , 2009, KDD.

[17]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[18]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[19]  Padmini Srinivasan,et al.  Hierarchical neural networks for text categorization (poster abstract) , 1999, SIGIR '99.

[20]  Alex A. Freitas,et al.  A survey of hierarchical classification across different application domains , 2010, Data Mining and Knowledge Discovery.

[21]  Anton Yuryev,et al.  Automatic extraction of gene ontology annotation and its correlation with clusters in protein networks , 2007, BMC Bioinformatics.