论文信息 - Utility-Theoretic Ranking for Semiautomated Text Classification - 字舞流文

Utility-Theoretic Ranking for Semiautomated Text Classification

Semiautomated Text Classification (SATC) may be defined as the task of ranking a set D of automatically labelled textual documents in such a way that, if a human annotator validates (i.e., inspects and corrects where appropriate) the documents in a top-ranked portion of D with the goal of increasing the overall labelling accuracy of D, the expected increase is maximized. An obvious SATC strategy is to rank D so that the documents that the classifier has labelled with the lowest confidence are top ranked. In this work, we show that this strategy is suboptimal. We develop new utility-theoretic ranking methods based on the notion of validation gain, defined as the improvement in classification effectiveness that would derive by validating a given automatically labelled document. We also propose a new effectiveness measure for SATC-oriented ranking methods, based on the expected reduction in classification error brought about by partially validating a list generated by a given ranking method. We report the results of experiments showing that, with respect to the baseline method mentioned earlier, and according to the proposed measure, our utility-theoretic ranking methods can achieve substantially higher expected reductions in classification error.

Giacomo Berardi | Andrea Esuli | Fabrizio Sebastiani | F. Sebastiani | Andrea Esuli | Giacomo Berardi

[1] David D. Lewis,et al. Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[2] Hema Raghavan,et al. Active Learning with Feedback on Features and Instances , 2006, J. Mach. Learn. Res..

[3] James P. Callan,et al. Training algorithms for linear text classifiers , 1996, SIGIR '96.

[4] John Platt,et al. Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[5] E. Rowland. Theory of Games and Economic Behavior , 1946, Nature.

[6] Alistair Moffat,et al. Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[7] IJsbrand Jan Aalbersberg,et al. Incremental relevance feedback , 1992, SIGIR '92.

[8] David D. Lewis,et al. Information retrieval for e-discovery , 2010, SIGIR.

[9] SebastianiFabrizio,et al. Utility-Theoretic Ranking for Semiautomated Text Classification , 2015 .

[10] W. Bruce Croft,et al. Combining classifiers in text categorization , 1996, SIGIR '96.

[11] Giacomo Berardi,et al. A utility-theoretic ranking method for semi-automated text classification , 2012, SIGIR '12.

[12] Carla E. Brodley,et al. Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[13] Andrea Esuli,et al. Improving Text Classification Accuracy by Training Label Cleaning , 2013, TOIS.

[14] Xiaojin Zhu,et al. Introduction to Semi-Supervised Learning , 2009, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[15] Kristen Grauman,et al. What's it going to cost you?: Predicting effort vs. informativeness for multi-label image annotations , 2009, CVPR.

[16] Jeffrey S. Simonoff,et al. A Penalty Function Approach to Smoothing Large Sparse Contingency Tables , 1983 .

[17] Andrea Esuli,et al. Active Learning Strategies for Multi-Label Text Classification , 2009, ECIR.

[18] Thomas Roelleke,et al. Document Difficulty Framework for Semi-automatic Text Classification , 2013, DaWaK.

[19] Daphne Koller,et al. Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[20] Andrea Esuli,et al. MP-Boost: A Multiple-Pivot Boosting Algorithm and Its Application to Text Categorization , 2006, SPIRE.

[21] Douglas W. Oard,et al. Evaluation of information retrieval for E-discovery , 2010, Artificial Intelligence and Law.

[22] CHENGXIANG ZHAI,et al. A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[23] Andrea Esuli,et al. Training Data Cleaning for Text Classification , 2009, ICTIR.

[24] Kenneth Ward Church,et al. - 1-What ’ s Wrong with Adding One ? , 1994 .

[25] Alexander Zien,et al. Semi-Supervised Learning , 2006 .

[26] P. Anand,et al. Foundations of Rational Choice Under Risk. , 1993 .

[27] Kamal Nigamyknigam,et al. Employing Em in Pool-based Active Learning for Text Classiication , 1998 .

[28] David D. Lewis,et al. Text categorization of low quality images , 1995 .

[29] Rong Jin,et al. Large-scale text categorization by batch mode active learning , 2006, WWW '06.

[30] Chris Buckley,et al. OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[31] Thorsten Joachims,et al. Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[32] Eric Horvitz,et al. Selective Supervision: Guiding Supervised Learning with Decision-Theoretic Active Learning , 2007, IJCAI.

[33] Andrew McCallum,et al. Employing EM and Pool-Based Active Learning for Text Classification , 1998, ICML.

[34] Rich Caruana,et al. Obtaining Calibrated Probabilities from Boosting , 2005, UAI.

[35] J. Neumann,et al. Theory of games and economic behavior , 1945, 100 Years of Math Milestones.

[36] Giacomo Berardi,et al. Optimising human inspection work in automated verbatim coding , 2014 .

[37] Fabrizio Sebastiani,et al. Machine learning in automated text categorization , 2001, CSUR.

[38] Charles Elkan,et al. The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[39] Yiming Yang,et al. A re-examination of text categorization methods , 1999, SIGIR '99.

[40] Yoshimi Suzuki,et al. Correcting Category Errors in Text Classification , 2004, COLING.

[41] Abhay Harpale,et al. Document Classification Through Interactive Supervision of Document and Term Labels , 2004, PKDD.

[42] Jiawei Han,et al. ACM Transactions on Knowledge Discovery from Data: Introduction , 2007 .

[43] P. Janssen,et al. Smoothing sparse contingency tables , 1998 .

[44] Yoram Singer,et al. BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[45] Thorsten Joachims,et al. Dynamic ranked retrieval , 2011, WSDM '11.

[46] David Cohn,et al. Active Learning , 2010, Encyclopedia of Machine Learning.

[47] F ChenStanley,et al. An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[48] Sirvan Yahyaei,et al. Semi-automatic Document Classification: Exploiting Document Difficulty , 2012, ECIR.

[49] Stephen E. Robertson,et al. A new interpretation of average precision , 2008, SIGIR '08.