Utility-Theoretic Ranking for Semiautomated Text Classification

Semiautomated Text Classification (SATC) may be defined as the task of ranking a set D of automatically labelled textual documents in such a way that, if a human annotator validates (i.e., inspects and corrects where appropriate) the documents in a top-ranked portion of D with the goal of increasing the overall labelling accuracy of D, the expected increase is maximized. An obvious SATC strategy is to rank D so that the documents that the classifier has labelled with the lowest confidence are top ranked. In this work, we show that this strategy is suboptimal. We develop new utility-theoretic ranking methods based on the notion of validation gain, defined as the improvement in classification effectiveness that would derive by validating a given automatically labelled document. We also propose a new effectiveness measure for SATC-oriented ranking methods, based on the expected reduction in classification error brought about by partially validating a list generated by a given ranking method. We report the results of experiments showing that, with respect to the baseline method mentioned earlier, and according to the proposed measure, our utility-theoretic ranking methods can achieve substantially higher expected reductions in classification error.

[1]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[2]  Hema Raghavan,et al.  Active Learning with Feedback on Features and Instances , 2006, J. Mach. Learn. Res..

[3]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[4]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[5]  E. Rowland Theory of Games and Economic Behavior , 1946, Nature.

[6]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[7]  IJsbrand Jan Aalbersberg,et al.  Incremental relevance feedback , 1992, SIGIR '92.

[8]  David D. Lewis,et al.  Information retrieval for e-discovery , 2010, SIGIR.

[9]  SebastianiFabrizio,et al.  Utility-Theoretic Ranking for Semiautomated Text Classification , 2015 .

[10]  W. Bruce Croft,et al.  Combining classifiers in text categorization , 1996, SIGIR '96.

[11]  Giacomo Berardi,et al.  A utility-theoretic ranking method for semi-automated text classification , 2012, SIGIR '12.

[12]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[13]  Andrea Esuli,et al.  Improving Text Classification Accuracy by Training Label Cleaning , 2013, TOIS.

[14]  Xiaojin Zhu,et al.  Introduction to Semi-Supervised Learning , 2009, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[15]  Kristen Grauman,et al.  What's it going to cost you?: Predicting effort vs. informativeness for multi-label image annotations , 2009, CVPR.

[16]  Jeffrey S. Simonoff,et al.  A Penalty Function Approach to Smoothing Large Sparse Contingency Tables , 1983 .

[17]  Andrea Esuli,et al.  Active Learning Strategies for Multi-Label Text Classification , 2009, ECIR.

[18]  Thomas Roelleke,et al.  Document Difficulty Framework for Semi-automatic Text Classification , 2013, DaWaK.

[19]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[20]  Andrea Esuli,et al.  MP-Boost: A Multiple-Pivot Boosting Algorithm and Its Application to Text Categorization , 2006, SPIRE.

[21]  Douglas W. Oard,et al.  Evaluation of information retrieval for E-discovery , 2010, Artificial Intelligence and Law.

[22]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[23]  Andrea Esuli,et al.  Training Data Cleaning for Text Classification , 2009, ICTIR.

[24]  Kenneth Ward Church,et al.  - 1-What ’ s Wrong with Adding One ? , 1994 .

[25]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[26]  P. Anand,et al.  Foundations of Rational Choice Under Risk. , 1993 .

[27]  Kamal Nigamyknigam,et al.  Employing Em in Pool-based Active Learning for Text Classiication , 1998 .

[28]  David D. Lewis,et al.  Text categorization of low quality images , 1995 .

[29]  Rong Jin,et al.  Large-scale text categorization by batch mode active learning , 2006, WWW '06.

[30]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[31]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[32]  Eric Horvitz,et al.  Selective Supervision: Guiding Supervised Learning with Decision-Theoretic Active Learning , 2007, IJCAI.

[33]  Andrew McCallum,et al.  Employing EM and Pool-Based Active Learning for Text Classification , 1998, ICML.

[34]  Rich Caruana,et al.  Obtaining Calibrated Probabilities from Boosting , 2005, UAI.

[35]  J. Neumann,et al.  Theory of games and economic behavior , 1945, 100 Years of Math Milestones.

[36]  Giacomo Berardi,et al.  Optimising human inspection work in automated verbatim coding , 2014 .

[37]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[38]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[39]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[40]  Yoshimi Suzuki,et al.  Correcting Category Errors in Text Classification , 2004, COLING.

[41]  Abhay Harpale,et al.  Document Classification Through Interactive Supervision of Document and Term Labels , 2004, PKDD.

[42]  Jiawei Han,et al.  ACM Transactions on Knowledge Discovery from Data: Introduction , 2007 .

[43]  P. Janssen,et al.  Smoothing sparse contingency tables , 1998 .

[44]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[45]  Thorsten Joachims,et al.  Dynamic ranked retrieval , 2011, WSDM '11.

[46]  David Cohn,et al.  Active Learning , 2010, Encyclopedia of Machine Learning.

[47]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[48]  Sirvan Yahyaei,et al.  Semi-automatic Document Classification: Exploiting Document Difficulty , 2012, ECIR.

[49]  Stephen E. Robertson,et al.  A new interpretation of average precision , 2008, SIGIR '08.