Integrating Collocation as TF-IDF Enhancement to Improve Classification Accuracy

The motivation of the study is to address the weakness of Term Frequency – Inverse Document Frequency (TF-IDF) in dealing with single terms because single terms can sometimes be vague. That is, a single term when used for indexing, could convey several interpretations. A single term can also be too general, in which, it doesn't have a discriminating power to differentiate terms such as from two individual terms such as "junior" and "college." It is not enough to distinguish "junior college" from "college junior". Thus, this study aims to enhance TF-IDF by integrating collocation as a term feature. The collocated terms are extracted based on the determination of part-of-speech (POS) that forms specific patterns such as adjective + noun, noun + noun, noun + verb, etc. There are three (3) document classifiers which had been considered in this study. These classifiers will be subjected to traditional and modified TF-IDF are RandomForest, MultinomialNB (MultiNB), and SVM. The result of this experiment shows that integrating collocation as part of the enhancement of the TF-IDF process outperforms the traditional TF-IDF by an increase of up to 10 percent.

[1]  Man Lan,et al.  A comparative study on term weighting schemes for text categorization , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[2]  Wei Wang,et al.  Improvement and Application of TF-IDF Algorithm in Text Orientation Analysis , 2016 .

[3]  Theodora Varvarigou,et al.  Text Classification Using the N-Gram Graph Representation Model Over High Frequency Data Streams , 2018, Front. Appl. Math. Stat..

[4]  Xiang Zhang,et al.  Text Understanding from Scratch , 2015, ArXiv.

[5]  Maulahikmah Galinium,et al.  Automated document classification for news article in Bahasa Indonesia based on term frequency inverse document frequency (TF-IDF) approach , 2014, 2014 6th International Conference on Information Technology and Electrical Engineering (ICITEE).

[6]  Yasuhiro Yamada,et al.  Weighting of Noun Phrases Based on Local Frequency of Nouns , 2018, SCDM.

[7]  Aditi Sharan,et al.  Keyword and Keyphrase Extraction Techniques: A Literature Review , 2015 .

[8]  Amit Savyanavar,et al.  Multi-Document Summarization Using TF-IDF Algorithm. , 2016 .

[9]  Kashif Javed,et al.  Improving Text Classification Performance with Random Forests-Based Feature Selection , 2015, Arabian Journal for Science and Engineering.

[10]  Lukas Michelbacher,et al.  Multi-word tokenization for natural language processing , 2013 .

[11]  Paula Buttery,et al.  A Text Normalisation System for Non-Standard English Words , 2017, NUT@EMNLP.

[12]  Tyler Robinson Disaster tweet classification using parts-of-speech tags: a domain adaptation approach , 2016 .

[13]  Karan Bajaj,et al.  News Classification and Its Techniques: A Review , 2016 .

[14]  Lidia Pivovarova,et al.  Automatic Collocation Extraction and Classification of Automatically Obtained Bigrams , 2014 .

[15]  Donald E. Brown,et al.  Text Classification Algorithms: A Survey , 2019, Inf..