NEWS CLASSIFICATION WITH HUMAN ANNOTATORS: A CASE STUDY

The need to classify textual documents has become an increasingly vibrant research field due to the development of online news. While most of the news in news website s are categorised manually, the task becomes more strenuous considering the tremendous surge of data update s every day. This paper addresses the question of how text classification algorithms can substitute the particular task over manual classification methods . A combined method using Bracewell's algorithm and top-n method is demonstrated and tested using Indonesian language corpus. The experiment also uses human evaluation as the benchmark. The result from the human evaluation is further investigated in order to understand how the annotators classify documents and the aspects that can be improved to enhance the method in the future. The results indicate that the method can outperform human annotators by 13% in terms of accuracy .

[1]  Christine D. Piatko,et al.  Using “Annotator Rationales” to Improve Machine Learning for Text Categorization , 2007, NAACL.

[2]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[3]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[4]  Bernardete Ribeiro,et al.  Learning from multiple annotators: Distinguishing good from random labelers , 2013, Pattern Recognit. Lett..

[5]  Plaban Kumar Bhowmick,et al.  Classifying Emotion in News Sentences: When Machine Classification Meets Human Classification , 2010 .

[6]  Claire Cardie,et al.  Automatically Generating Annotator Rationales to Improve Sentiment Classification , 2010, ACL.

[7]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[8]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[9]  Shingo Kuroiwa,et al.  Category Classification and Topic Discovery of Japanese and English News Articles , 2006, MFCSIT.

[10]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[11]  Charu C. Aggarwal,et al.  A Survey of Text Classification Algorithms , 2012, Mining Text Data.

[12]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[13]  Jason Eisner,et al.  Modeling Annotators: A Generative Approach to Learning from Annotator Rationales , 2008, EMNLP.

[14]  Derrick Higgins,et al.  Reliability of human annotation of semantic roles in noisy text , 2007, International Conference on Semantic Computing (ICSC 2007).

[15]  Mehran Sahami,et al.  Text Mining: Classification, Clustering, and Applications , 2009 .

[16]  Stan Matwin,et al.  Email classification with co-training , 2011, CASCON.

[17]  Kamel Smaïli,et al.  A Comparative Study of Topic Identification on Newspaper and E-mail , 2001, SPIRE.

[18]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[19]  Ahmed Ghoneim,et al.  Naive Bayes Classifier based Arabic document categorization , 2010, 2010 The 7th International Conference on Informatics and Systems (INFOS).

[20]  F. Ren,et al.  Multilingual single document keyword extraction for information retrieval , 2005, 2005 International Conference on Natural Language Processing and Knowledge Engineering.

[21]  Chung Keung Poon,et al.  Using phrases as features in email classification , 2009, J. Syst. Softw..

[22]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[23]  Xiaolong Wang,et al.  Topic sentiment analysis in twitter: a graph-based hashtag sentiment classification approach , 2011, CIKM '11.