CESS-A System to Categorize Bangla Web Text Documents

Technology has evolved remarkably, which has led to an exponential increase in the availability of digital text documents of disparate domains over the Internet. This makes the retrieval of the information a very much time- and resource-consuming task. Thus, a system that can categorize such documents based on their domains can truly help the users in obtaining the required information with relative ease and also reduce the workload of the search engines. This article presents a text categorization system (CESS) that categorizes text document using newly proposed hybrid features that combines term frequency-inverse document frequency-inverse class frequency and modified chi-square methods. Experiments were performed on real-world Bangla documents from eight domains comprises of 24,29,857 tokens, and the highest accuracy of 99.91% has been obtained with multilayer perceptron-based classification. Also, the experiments were tested on Reuters-21578 and 20 Newsgroups datasets and obtained accuracies of 97.29% and 94.67%, respectively, to show the language-independent nature of the system.

[1]  Jatinderkumar R. Saini,et al.  Classification of Gujarati Documents using Naïve Bayes Classifier , 2017 .

[2]  K. Rajan,et al.  Automatic classification of Tamil documents using vector space model and artificial neural network , 2009, Expert Syst. Appl..

[3]  Sushma R. Vispute,et al.  Automatic text categorization of marathi documents using clustering technique , 2013, 2013 15th International Conference on Advanced Computing Technologies (ICACT).

[4]  Niladri Sekhar Dash,et al.  Categorization of Bangla Web Text Documents Based on TF-IDF-ICF Text Analysis Scheme , 2018 .

[5]  Junjie Li,et al.  Incorporating Multi-Level User Preference into Document-Level Sentiment Classification , 2019, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[6]  Esfandiar Eslami,et al.  Global Filter-Wrapper method based on class-dependent correlation for text classification , 2019, Eng. Appl. Artif. Intell..

[7]  Yves Chauvin,et al.  Backpropagation: theory, architectures, and applications , 1995 .

[8]  Mohammed Rokibul Alam Kotwal,et al.  Bangla text document categorization using Stochastic Gradient Descent (SGD) classifier , 2015, 2015 International Conference on Cognitive Computing and Information Processing(CCIP).

[9]  Md. Saiful Islam,et al.  A Comparative Study on Different Types of Approaches to Bengali document Categorization , 2017, ArXiv.

[10]  Diab Abuaiadah,et al.  Using Bisect K-Means Clustering Technique in the Analysis of Arabic Documents , 2016, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[11]  Xiaodong Gu,et al.  Balancing between over-weighting and under-weighting in supervised term weighting , 2016, Inf. Process. Manag..

[12]  Abbas Raza Ali,et al.  Urdu text classification , 2009, FIT.

[13]  SangKeun Lee,et al.  Adaptive Convolution for Text Classification , 2019, NAACL-HLT.

[14]  Md Tanvir Alam,et al.  BARD: Bangla Article Classification Using a New Comprehensive Dataset , 2018, 2018 International Conference on Bangla Speech and Language Processing (ICBSLP).

[15]  Abhijit Mahabal,et al.  Text Classification with Few Examples using Controlled Generalization , 2019, NAACL-HLT.

[16]  Prasanna S. Haddela,et al.  Effectiveness of rule-based classifiers in Sinhala text categorization , 2017, 2017 National Information Technology Conference (NITC).

[17]  Md. Saiful Islam,et al.  A support vector machine mixed with TF-IDF algorithm to categorize Bengali document , 2017, 2017 International Conference on Electrical, Computer and Communication Engineering (ECCE).

[18]  Mayy M. Al-Tahrawi Arabic Text Categorization Using Logistic Regression , 2015 .

[19]  Keivan Borna,et al.  Hierarchical LSTM network for text classification , 2019 .

[20]  Fragkiskos D. Malliaros,et al.  Graph-based term weighting for text categorization , 2015, 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[21]  Youngjoong Ko,et al.  A study of term weighting schemes using class information for text classification , 2012, SIGIR '12.

[22]  Raúl Monroy,et al.  Some features speak loud, but together they all speak louder: A study on the correlation between classification error and feature usage in decision-tree classification ensembles , 2018, Eng. Appl. Artif. Intell..

[23]  Niladri Sekhar Dash,et al.  Classification of Bangla Text Documents based on Inverse Class Frequency , 2018, 2018 3rd International Conference On Internet of Things: Smart Innovation and Usages (IoT-SIU).

[24]  Daniela Moctezuma,et al.  An Automated Text Categorization Framework based on Hyperparameter Optimization , 2017, Knowl. Based Syst..

[25]  Donald E. Brown,et al.  RMDL: Random Multimodel Deep Learning for Classification , 2018, ICISDM '18.

[26]  Robert Hecht-Nielsen,et al.  Theory of the backpropagation neural network , 1989, International 1989 Joint Conference on Neural Networks.

[27]  Shalini Puri,et al.  A technical study and analysis of text classification techniques in N - Lingual documents , 2016, 2016 International Conference on Computer Communication and Informatics (ICCCI).

[28]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[29]  Hamid Parvin,et al.  Improving Persian Text Classification and Clustering Using Persian Thesaurus , 2012, DCAI.

[30]  Yu Xue,et al.  Text classification based on deep belief network and softmax regression , 2016, Neural Computing and Applications.

[31]  Dino Isa,et al.  A hybrid text classification approach with low dependency on parameter by integrating K-nearest neighbor and support vector machine , 2012, Expert Syst. Appl..

[32]  Yuan Tian,et al.  Chi-square Statistics Feature Selection Based on Term Frequency and Distribution for Text Categorization , 2015 .

[33]  Sivaji Bandyopadhyay,et al.  Named Entity Recognition and transliteration in Bengali , 2007 .

[34]  Fardin Ahmadizar,et al.  A novel multivariate filter method for feature selection in text classification problems , 2018, Eng. Appl. Artif. Intell..

[35]  Nidhi Punjabi Text Classification using Naive Bayes, Centroid and Hybrid Approach , 2012 .

[36]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[37]  Vivek Kumar Singh,et al.  A lexicon pool augmented Naive Bayes Classifier for Nepali Text , 2014, 2014 Seventh International Conference on Contemporary Computing (IC3).

[38]  Kaushik Roy,et al.  Application of TF-IDF Feature for Categorizing Documents of Online Bangla Web Text Corpus , 2018 .

[39]  Qasem A. Al-Radaideh,et al.  An associative rule-based classifier for Arabic medical text , 2015, Int. J. Knowl. Eng. Data Min..

[40]  Tunga Güngör,et al.  A tree-based learning approach for document structure analysis and its application to web search , 2014, Natural Language Engineering.

[41]  Takahiro Hara,et al.  Wikipedia-Based Relatedness Measurements for Multilingual Short Text Clustering , 2018, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[42]  Huan-Chao Keh,et al.  Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values , 2010, Knowl. Based Syst..

[43]  Bruno Trstenjak,et al.  on Intelligent Manufacturing and Automation , 2013 KNN with TF-IDF Based Framework for Text Categorization , 2014 .

[44]  Meng Chang Chen,et al.  Using chi-square statistics to measure similarities for text categorization , 2011, Expert Syst. Appl..

[45]  Guozhong Feng,et al.  A probabilistic model derived term weighting scheme for text classification , 2018, Pattern Recognit. Lett..

[46]  Nagaraju Bogiri,et al.  Automatic text categorization: Marathi documents , 2015, 2015 International Conference on Energy Systems and Applications.

[47]  Romuere Rôdrigues Veloso e Silva,et al.  Leukemia diagnosis in blood slides using transfer learning in CNNs and SVM for classification , 2018, Eng. Appl. Artif. Intell..