Urdu Text Classification using Majority Voting

Text classification is a tool to assign the predefined categories to the text documents using supervised machine learning algorithms. It has various practical applications like spam detection, sentiment detection, and detection of a natural language. Based on the idea we applied five well-known classification techniques on Urdu language corpus and assigned a class to the documents using majority voting. The corpus contains 21769 news documents of seven categories (Business, Entertainment, Culture, Health, Sports, and Weird). The algorithms were not able to work directly on the data, so we applied the preprocessing techniques like tokenization, stop words removal and a rule-based stemmer. After preprocessing 93400 features are extracted from the data to apply machine learning algorithms. Furthermore, we achieved up to 94% precision and recall using majority voting.

[1]  Sukhjit Singh Sehra A REVIEW PAPER ON ALGORITHMS USED FOR TEXT CLASSIFICATION , 2013 .

[2]  Vipin Kumar,et al.  Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification , 2001, PAKDD.

[3]  Abdulmohsen Al-Thubaity,et al.  Automatic Arabic Text Classification , 2008 .

[4]  Sarmad Hussain,et al.  Assas-band, an Affix-Exception-List Based Urdu Stemmer , 2009, ALR7@IJCNLP.

[5]  Mukesh A. Zaveri,et al.  AUTOMATIC TEXT CLASSIFICATION: A TECHNICAL REVIEW , 2011 .

[6]  Shreyes Seshasai,et al.  Document Classification for Newspaper Articles , 2012 .

[7]  S. M. Kamruzzaman,et al.  Text Classification using Data Mining , 2010, ArXiv.

[8]  Jing Wen,et al.  Text Categorization System for Stock Prediction , 2015 .

[9]  Shehzad Khalid,et al.  Framework for Urdu News Headlines Classification , 2016 .

[10]  Vishal Gupta,et al.  Domain Based Classification of Punjabi Text Documents using Ontology and Hybrid Based Approach , 2012, WSSANLP@COLING.

[11]  Waheed Iqbal,et al.  A Rule based Stemming Method for Multilingual Urdu Text , 2016 .

[12]  Menaka Text Classification using Keyword Extraction Technique , 2014 .

[13]  Shi Bing,et al.  Inductive learning algorithms and representations for text categorization , 2006 .

[14]  Abbas Raza Ali,et al.  Urdu text classification , 2009, FIT.

[15]  Ashraf Odeh,et al.  Arabic Text Categorization Algorithm using Vector Evaluation Method , 2015, ArXiv.

[16]  Jianxin Li,et al.  Text Classification Using Lifelong Machine Learning , 2017, ICONIP.

[17]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[18]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[19]  Natheer Khasawneh,et al.  Feature reduction techniques for Arabic text categorization , 2009 .