Document-Level Text Classification Using Single-Layer Multisize Filters Convolutional Neural Network

The rapid growth of electronic documents are causing problems like unstructured data that need more time and effort to search a relevant document. Text Document Classification (TDC) has a great significance in information processing and retrieval where unstructured documents are organized into pre-defined classes. Urdu is the most favorite research language in South Asian languages because of its complex morphology, unique features, and lack of linguistic resources like standard datasets. As compared to short text, like sentiment analysis, long text classification needs more time and effort because of large vocabulary, more noise, and redundant information. Machine Learning (ML) and Deep Learning (DL) models have been widely used in text processing. Despite the major limitations of ML models, like learn directed features, these are the favorite methods for Urdu TDC. To the best of our knowledge, it is the first study of Urdu TDC using DL model. In this paper, we design a large multi-purpose and multi-format dataset that contain more than ten thousand documents organize into six classes. We use Single-layer Multisize Filters Convolutional Neural Network (SMFCNN) for classification and compare its performance with sixteen ML baseline models on three imbalanced datasets of various sizes. Further, we analyze the effects of preprocessing methods on SMFCNN performance. SMFCNN outperformed the baseline classifiers and achieved 95.4%, 91.8%, and 93.3% scores of accuracy on medium, large and small size dataset respectively. The designed dataset would be publically and freely available in different formats for future research in Urdu text processing.

[1]  Manisha Sharma,et al.  Optimizing semantic LSTM for spam detection , 2019 .

[2]  Epaminondas Kapetanios,et al.  Are Deep Learning Approaches Suitable for Natural Language Processing? , 2016, NLDB.

[3]  Umapada Pal,et al.  Language, Script, and Font Recognition , 2014, Handbook of Document Image Processing and Recognition.

[4]  Shehzad Khalid,et al.  Framework for Urdu News Headlines Classification , 2016 .

[5]  Muhammad Usman,et al.  Urdu Text Classification using Majority Voting , 2016 .

[6]  Li Na,et al.  Chinese News Classification , 2018, 2018 IEEE International Conference on Big Data and Smart Computing (BigComp).

[7]  Qaiser Abbas,et al.  Comparative Study of Feature Selection Approaches for Urdu Text Categorization , 2015 .

[8]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[9]  Awais Adnan,et al.  Urdu Optical Character Recognition Systems: Present Contributions and Future Directions , 2018, IEEE Access.

[10]  Trung Huynh,et al.  Text classification with deep neural networks , 2019 .

[11]  Abdul Jabbar,et al.  A survey on Urdu and Urdu like language stemmers and stemming techniques , 2016, Artificial Intelligence Review.

[12]  Taghi M. Khoshgoftaar,et al.  Feature Selection with High-Dimensional Imbalanced Data , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[13]  Ali Daud,et al.  Urdu language processing: a survey , 2017, Artificial Intelligence Review.

[14]  Scott Krig,et al.  Feature Learning and Deep Learning Architecture Survey , 2016 .

[15]  Zhiyuan Liu,et al.  A C-LSTM Neural Network for Text Classification , 2015, ArXiv.

[16]  Qasem A. Al-Radaideh,et al.  An Arabic text categorization approach using term weighting and multiple reducts , 2018, Soft Comput..

[17]  Rudy,et al.  News Article Text Classification in Indonesian Language , 2017, ICCSCI.

[18]  Zhiyong Feng,et al.  LSTM with sentence representations for document-level sentiment classification , 2018, Neurocomputing.

[19]  Paul Rayson,et al.  COUNTER: corpus of Urdu news text reuse , 2017, Lang. Resour. Evaluation.

[20]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[21]  Charu C. Aggarwal,et al.  Text Sequence Modeling and Deep Learning , 2018 .

[22]  Mahmoud Taleb Beidokhti,et al.  Advances in Intelligent Systems and Computing , 2016 .

[23]  Daryl Essam,et al.  Sentiment Analysis System for Roman Urdu , 2018 .

[24]  Mustafa Çagatayli,et al.  The Effect of Stemming and Stop-Word-Removal on Automatic Text Classification in Turkish Language , 2015, ICONIP.

[25]  Guanzheng Tan,et al.  The Effect of Preprocessing on Arabic Document Categorization , 2016, Algorithms.

[26]  Gang Liu,et al.  Bidirectional LSTM with attention mechanism and convolutional layer for text classification , 2019, Neurocomputing.

[27]  Santanu Kumar Rath,et al.  Document-level sentiment classification using hybrid machine learning approach , 2017, Knowledge and Information Systems.

[28]  Serkan Günal,et al.  The impact of preprocessing on text classification , 2014, Inf. Process. Manag..

[29]  Jinoh Kim,et al.  A survey of deep learning-based network anomaly detection , 2017, Cluster Computing.

[30]  Taghi M. Khoshgoftaar,et al.  Improving deep neural network design with new text data representations , 2017, Journal of Big Data.

[31]  Kashif Riaz,et al.  A Study in Urdu Corpus Construction , 2002, ALR@COLING.

[32]  Labiba Souici-Meslati,et al.  Automatic analysis of handwriting for gender classification , 2014, Pattern Analysis and Applications.

[33]  Jian Weng,et al.  Feature selection for text classification: A review , 2018, Multimedia Tools and Applications.

[34]  Rehab Duwairi,et al.  A study of the effects of preprocessing strategies on sentiment analysis for Arabic text , 2014, J. Inf. Sci..

[35]  Dong Yu,et al.  Deep Learning: Methods and Applications , 2014, Found. Trends Signal Process..

[36]  Murat Can Ganiz,et al.  Semantic text classification: A survey of past and recent advances , 2018, Inf. Process. Manag..

[37]  Abbas Raza Ali,et al.  Urdu text classification , 2009, FIT.

[38]  Hong Liang,et al.  Text feature extraction based on deep learning: a review , 2017, EURASIP Journal on Wireless Communications and Networking.

[39]  Muhammad Aslam,et al.  Semantic Similarity Analysis of Urdu Documents , 2017, MCPR.

[40]  Majid A. Al-Taee,et al.  Automatic diacritization of Arabic text using recurrent neural networks , 2015, International Journal on Document Analysis and Recognition (IJDAR).

[41]  Akihiko Ohsuga,et al.  Text Classification and Transfer Learning Based on Character-Level Deep Convolutional Neural Networks , 2017, ICAART.

[42]  Fatih Yücalar,et al.  TTC-3600: A new benchmark dataset for Turkish text categorization , 2017, J. Inf. Sci..

[43]  Rajiv Kumar,et al.  Punjabi document classification using vector evaluation method , 2017, 2017 International Conference on Computing Methodologies and Communication (ICCMC).

[44]  Ausif Mahmood,et al.  Deep Learning approach for sentiment analysis of short texts , 2017, 2017 3rd International Conference on Control, Automation and Robotics (ICCAR).

[45]  Marcin Mironczuk,et al.  A recent overview of the state-of-the-art elements of text classification , 2018, Expert Syst. Appl..

[46]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[47]  Hao Chen,et al.  A Comparison of Classical Versus Deep Learning Techniques for Abusive Content Detection on Social Media Sites , 2018, SocInfo.

[48]  Muhammad Aslam,et al.  Mining the Urdu Language-Based Web Content for Opinion Extraction , 2017, MCPR.

[49]  Haitao Huang,et al.  Abstractive text summarization using LSTM-CNN based deep learning , 2018, Multimedia Tools and Applications.

[50]  Yijing Li,et al.  Learning from class-imbalanced data: Review of methods and applications , 2017, Expert Syst. Appl..