Event classification from the Urdu language text on social media

The real-time availability of the Internet has engaged millions of users around the world. The usage of regional languages is being preferred for effective and ease of communication that is causing multilingual data on social networks and news channels. People share ideas, opinions, and events that are happening globally i.e., sports, inflation, protest, explosion, and sexual assault, etc. in regional (local) languages on social media. Extraction and classification of events from multilingual data have become bottlenecks because of resource lacking. In this research paper, we presented the event classification task for the Urdu language text existing on social media and the news channels by using machine learning classifiers. The dataset contains more than 0.1 million (102,962) labeled instances of twelve (12) different types of events. The title, its length, and the last four words of a sentence are used as features to classify the events. The Term Frequency-Inverse Document Frequency (tf-idf) showed the best results as a feature vector to evaluate the performance of the six popular machine learning classifiers. Random Forest (RF) and K-Nearest Neighbor (KNN) are among the classifiers that out-performed among other classifiers by achieving 98.00% and 99.00% accuracy, respectively. The novelty lies in the fact that the features aforementioned are not applied, up to the best of our knowledge, in the event extraction of the text written in the Urdu language.

[1]  Vassilis Kostakos,et al.  CrisisTracker: Crowdsourced social media curation for disaster awareness , 2013, IBM J. Res. Dev..

[2]  Kashif Riaz,et al.  Concept search in Urdu , 2008, PIKM '08.

[3]  Gyu Sang Choi,et al.  Tweets Classification on the Base of Sentiments for US Airline Companies , 2019, Entropy.

[4]  Yaxin Bi,et al.  KNN Model-Based Approach in Classification , 2003, OTM.

[5]  Tony McEnery,et al.  Corpus data for South Asian language processing. , 2003 .

[6]  Yurong Zhong,et al.  The analysis of cases based on decision tree , 2016, 2016 7th IEEE International Conference on Software Engineering and Service Science (ICSESS).

[7]  Tariq Rahim Soomro,et al.  Twitter and Urdu , 2018, 2018 International Conference on Computing, Mathematics and Engineering Technologies (iCoMET).

[8]  Yongli Zhang,et al.  Support Vector Machine Classification Algorithm and Its Application , 2012, ICICA.

[9]  Rohini K. Srihari,et al.  An Information-Extraction System for Urdu---A Resource-Poor Language , 2010, TALIP.

[10]  Owen Rambow,et al.  Automatic Detection and Classification of Social Events , 2010, EMNLP.

[11]  Ali Daud,et al.  Urdu language processing: a survey , 2017, Artificial Intelligence Review.

[12]  Sarmad Hussain,et al.  Binarization and its evaluation for Urdu Nastalique document images , 2013, INMIC.

[13]  Philipp Cimiano,et al.  Event-based classification of social media streams , 2012, ICMR.

[14]  Sarmad Hussain,et al.  Resources for Urdu Language Processing , 2008, IJCNLP.

[15]  Malik Muhammad Saad Missen,et al.  Multiclass Event Classification from Text , 2021, Sci. Program..

[16]  Tong Zhang,et al.  Text Categorization Based on Regularized Linear Classification Methods , 2001, Information Retrieval.

[17]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[18]  Yogesh Kumar Dwivedi,et al.  Event classification and location prediction from tweets during disasters , 2017, Annals of Operations Research.

[19]  Pete Burnap,et al.  Arabic Event Detection in Social Media , 2015, CICLing.

[20]  Amit P. Sheth,et al.  Intent Classification of Short-Text on Social Media , 2015, 2015 IEEE International Conference on Smart City/SocialCom/SustainCom (SmartCity).

[21]  Sudeshna Sarkar,et al.  A Neural Network based Event Extraction System for Indian Languages , 2018, FIRE.

[22]  Shuo Xu,et al.  Bayesian Naïve Bayes classifiers to text classification , 2018, J. Inf. Sci..

[23]  Abbas Raza Ali,et al.  Urdu text classification , 2009, FIT.

[24]  Shehzad Khalid,et al.  Framework for Urdu News Headlines Classification , 2016 .

[25]  Abeed Sarker,et al.  Portable automatic text classification for adverse drug reaction detection via multi-corpus training , 2015, J. Biomed. Informatics.

[26]  Stefan Savage,et al.  Measuring Online Service Availability Using Twitter , 2010, WOSN.

[27]  Soroush Vosoughi,et al.  Enhanced Twitter Sentiment Classification Using Contextual Information , 2015, WASSA@EMNLP.

[28]  Philip S. Yu,et al.  Multi-Label Collective Classification , 2011, SDM.

[29]  W. Y. Ayele Adapting CRISP-DM for Idea Mining: A Data Mining Process for Generating Ideas Using a Textual Dataset , 2021 .

[30]  Muhammad Kamran Malik,et al.  Urdu Named Entity Recognition and Classification System Using Artificial Neural Network , 2017, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[31]  A. Lenhart,et al.  Teens and Mobile Phones: Text Messaging Explodes as Teens Embrace It as the Centerpiece of Their Communication Strategies with Friends. , 2010 .

[32]  Jie Yin,et al.  Using Social Media to Enhance Emergency Situation Awareness , 2012, IEEE Intelligent Systems.

[33]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[34]  Frederick Livingston,et al.  Implementation of Breiman's Random Forest Machine Learning Algorithm , 2005 .

[35]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[36]  Qasem A. Al-Radaideh,et al.  An Arabic text categorization approach using term weighting and multiple reducts , 2018, Soft Comput..

[37]  Khaled Shaalan,et al.  Using Arabic Social Media Feeds for Incident and Emergency Management in Smart Cities , 2018, 2018 3rd International Conference on Smart and Sustainable Technologies (SpliTech).

[38]  Kashif Riaz,et al.  A Study in Urdu Corpus Construction , 2002, ALR@COLING.