Performance of Classifiers in Bangla Text Categorization

Automated text categorization or text classification has become an important text mining task especially with the speedy development and increase of the number of on-line documents. Automatic text classification system aims to assign the text documents to their predefined categories based on some linguistic characteristics. Although research has progressed significantly for languages like English, Arabic, Chinese, etc., there has not been much development for the Indian Languages especially for Bangla which is one of the most popular languages of India and Bangladesh. One reason for this is the inherent complexity of Bangla which is accompanied by the unavailability of standard datasets and resources. In this paper, the performance of different classifiers is presented for the task of text classification based on ‘term association’ and ‘term aggregation’ feature extraction methods and an accuracy of 98.68% has been obtained on dataset of 8000 Bangla text documents procured from various web sources.

[1]  M. Hanumanthappa,et al.  Indian Language Text Representation and Categorization Using Supervised Learning Algorithm , 2013 .

[2]  Nagaraju Bogiri,et al.  Automatic text categorization: Marathi documents , 2015, 2015 International Conference on Energy Systems and Applications.

[3]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[4]  Saptarsi Goswami,et al.  A Novel Feature Selection Technique for Text Classification Using Naïve Bayes , 2014, International scholarly research notices.

[5]  Kaushik Roy,et al.  Application of TF-IDF Feature for Categorizing Documents of Online Bangla Web Text Corpus , 2018 .

[6]  Qasem A. Al-Radaideh,et al.  An associative rule-based classifier for Arabic medical text , 2015, Int. J. Knowl. Eng. Data Min..

[7]  Md. Saiful Islam,et al.  A support vector machine mixed with TF-IDF algorithm to categorize Bengali document , 2017, 2017 International Conference on Electrical, Computer and Communication Engineering (ECCE).

[8]  Sharvari Govilkar,et al.  Text Classification for Marathi Documents using Supervised Learning Methods , 2016 .

[9]  D. S. Guru,et al.  A Novel Term_Class Relevance Measure for Text Categorization , 2016, ArXiv.

[10]  P. K. Santi,et al.  Semantic Based Text Classification Using WordNets: Indian Language Perspective , 2006 .

[11]  Amarnath Bose,et al.  Notice of Retraction: Electrical Power Generation with Himalayan Mud Soil using Microbial Fuel Cell , 2016 .

[12]  Mohammed Rokibul Alam Kotwal,et al.  Bangla text document categorization using Stochastic Gradient Descent (SGD) classifier , 2015, 2015 International Conference on Cognitive Computing and Information Processing(CCIP).

[13]  Philippe Lenca,et al.  Arabic Language Text Classification Using Dependency Syntax-Based Feature Selection , 2014, ArXiv.

[14]  Peng Jin,et al.  Bag-of-Embeddings for Text Classification , 2016, IJCAI.

[15]  Vishal Gupta Punjabi Text Classification using Naïve Bayes, Centroid and Hybrid Approach , 2012 .

[16]  Md. Saiful Islam,et al.  A Comparative Study on Different Types of Approaches to Bengali document Categorization , 2017, ArXiv.

[17]  Ankita Dhar,et al.  Classification of text documents through distance measurement: An experiment with multi-domain Bangla text documents , 2017, 2017 3rd International Conference on Advances in Computing,Communication & Automation (ICACCA) (Fall).

[18]  Abbas Raza Ali,et al.  Urdu text classification , 2009, FIT.

[19]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[20]  Ashis Kumar Mandal,et al.  Supervised learning Methods for Bangla Web Document Categorization , 2014, ArXiv.

[21]  Naushad UzZaman,et al.  Analysis of N-Gram based text categorization for Bangla in a newspaper , 2006 .