A utility-theoretic ranking method for semi-automated text classification

In Semi-Automated Text Classification (SATC) an automatic classifier F labels a set of unlabelled documents D, following which a human annotator inspects (and corrects when appropriate) the labels attributed by F to a subset D' of D, with the aim of improving the overall quality of the labelling. An automated system can support this process by ranking the automatically labelled documents in a way that maximizes the expected increase in effectiveness that derives from inspecting D. An obvious strategy is to rank D so that the documents that F has classified with the lowest confidence are top-ranked. In this work we show that this strategy is suboptimal. We develop a new utility-theoretic ranking method based on the notion of inspection gain, defined as the improvement in classification effectiveness that would derive by inspecting and correcting a given automatically labelled document. We also propose a new effectiveness measure for SATC-oriented ranking methods, based on the expected reduction in classification error brought about by partially inspecting a list generated by a given ranking method. We report the results of experiments showing that, with respect to the baseline method above, and according to the proposed measure, our ranking method can achieve substantially higher expected reductions in classification error.

[1]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[2]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[3]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[4]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[5]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[6]  Kamal Nigamyknigam,et al.  Employing Em in Pool-based Active Learning for Text Classiication , 1998 .

[7]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[8]  P. Janssen,et al.  Smoothing sparse contingency tables , 1998 .

[9]  Rong Jin,et al.  Large-scale text categorization by batch mode active learning , 2006, WWW '06.

[10]  Stephen E. Robertson,et al.  A new interpretation of average precision , 2008, SIGIR '08.

[11]  Jeffrey S. Simonoff,et al.  A Penalty Function Approach to Smoothing Large Sparse Contingency Tables , 1983 .

[12]  Andrea Esuli,et al.  Active Learning Strategies for Multi-Label Text Classification , 2009, ECIR.

[13]  Andrew McCallum,et al.  Employing EM and Pool-Based Active Learning for Text Classification , 1998, ICML.

[14]  W. Bruce Croft,et al.  Combining classifiers in text categorization , 1996, SIGIR '96.

[15]  Abhay Harpale,et al.  Document Classification Through Interactive Supervision of Document and Term Labels , 2004, PKDD.

[16]  Andrea Esuli,et al.  MP-Boost: A Multiple-Pivot Boosting Algorithm and Its Application to Text Categorization , 2006, SPIRE.

[17]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[18]  Andrea Esuli,et al.  Training Data Cleaning for Text Classification , 2009, ICTIR.

[19]  Yoshimi Suzuki,et al.  Correcting Category Errors in Text Classification , 2004, COLING.

[20]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[21]  Hema Raghavan,et al.  Active Learning with Feedback on Features and Instances , 2006, J. Mach. Learn. Res..

[22]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[23]  Kenneth Ward Church,et al.  - 1-What ’ s Wrong with Adding One ? , 1994 .

[24]  P.J.M. de Haan,et al.  Corpus-based research into language. In honour of Jan Aarts , 1994 .

[25]  Douglas W. Oard,et al.  Evaluation of information retrieval for E-discovery , 2010, Artificial Intelligence and Law.

[26]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..