Co-occurrences using Fasttext embeddings for word similarity tasks in Urdu

Urdu is a widely spoken language in South Asia. Though immoderate literature exists for the Urdu language still the data isn’t enough to naturally process the language by NLP techniques. Very efficient language models exist for the English language, a high resource language, but Urdu and other underresourced languages have been neglected for a long time. To create efficient language models for these languages we must have good word embedding models. For Urdu, we can only find word embeddings trained and developed using the skip-gram model. In this paper, we have built a corpus for Urdu by scraping and integrating data from various sources and compiled a vocabulary for Urdu language. We also modify fasttext embeddings and NGrams models to enable training them on our built corpus. We have used these trained embeddings for a word similarity task and compared the results with existing techniques. The datasets and code is made freely available on GitHub..

[1]  Taghi M. Khoshgoftaar,et al.  Survey on categorical data for neural networks , 2020, Journal of Big Data.

[2]  M. Karsten,et al.  An axiomatic basis for communication , 2007, SIGCOMM '07.

[3]  Peter van Beek,et al.  A constraint programming approach for integrated spatial and temporal scheduling for clustered architectures , 2013, TECS.

[4]  Muhammad Tariq,et al.  Accurate detection of sitting posture activities in a secure IoT based assisted living environment , 2018, Future Gener. Comput. Syst..

[5]  Samar Haider,et al.  Urdu Word Embeddings , 2018, LREC.

[6]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[7]  Thar Baker,et al.  AlphaLogger: detecting motion-based side-channel attack using smartphone keystrokes , 2020, Journal of Ambient Intelligence and Humanized Computing.

[8]  David R. Cheriton,et al.  MAXSM: A Multi-Heuristic Approach to XML Schema Matching , 2006 .

[9]  Mirza Beg FLECS: A Framework for Rapidly Implementing Forwarding Protocols , 2009, Complex.

[10]  Khan Muhammad,et al.  Understanding Citizen Issues through Reviews: A Step towards Data Informed Planning in Smart Cities , 2018, Applied Sciences.

[11]  Mohammad Abid Khan,et al.  Urdu Sentiment Analysis Using Supervised Machine Learning Approach , 2018, Int. J. Pattern Recognit. Artif. Intell..

[12]  Muhammad Asim,et al.  DeepDetect: Detection of Distributed Denial of Service Attacks Using Deep Learning , 2019, Comput. J..

[13]  Vitalii Zhelezniak,et al.  Correlation Coefficients and Semantic Textual Similarity , 2019, NAACL.

[14]  Matthijs Douze,et al.  FastText.zip: Compressing text classification models , 2016, ArXiv.

[15]  Thar Baker,et al.  A collaborative healthcare framework for shared healthcare plan with ambient intelligence , 2020, Hum. centric Comput. Inf. Sci..

[16]  Mirza Omer Beg FLECS: A Data-Driven Framework for Rapid Protocol Prototyping , 2007 .

[17]  Mirza Beg,et al.  A Memory Accounting Interface for The Java Programming Language , 2001 .

[18]  Naveed Akhtar,et al.  Subspace Gaussian Mixture Model for Continuous Urdu Speech Recognition using Kaldi , 2020, 2020 14th International Conference on Open Source Systems and Technologies (ICOSST).

[19]  M. Beg,et al.  Pulmonary Crohn's Disease in Down Syndrome: A Link or Linkage Problem , 2016, Case Reports in Gastroenterology.

[20]  Hammad Majeed,et al.  Relationship Identification Between Conversational Agents Using Emotion Analysis , 2021, Cogn. Comput..

[21]  Hamza M. Alvi,et al.  EnSights: A tool for energy aware software development , 2017, 2017 13th International Conference on Emerging Technologies (ICET).

[22]  Mirza Omer Beg,et al.  A deep learning framework for clickbait detection on social area network using natural language cues , 2020, J. Comput. Soc. Sci..

[23]  Mubashir Ali,et al.  A framework of Urdu topic modeling using latent dirichlet allocation (LDA) , 2018, 2018 IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC).

[24]  Saif ur Rehman Khan,et al.  BigData Analysis of Stack Overflow for Energy Consumption of Android Framework , 2019, 2019 International Conference on Innovative Computing (ICIC).

[25]  Adeel Zafar,et al.  Search-based procedural content generation for GVG-LG , 2020, Appl. Soft Comput..

[26]  Adeel Zafar,et al.  A Constructive Approach for General Video Game Level Generation , 2019, 2019 11th Computer Science and Electronic Engineering (CEEC).

[27]  Mirza Omer Beg,et al.  Towards energy aware object-oriented development of android applications , 2019, Sustain. Comput. Informatics Syst..

[28]  Mirza Omer Beg,et al.  WEEC: Web Energy Efficient Computing: A machine learning approach , 2019, Sustain. Comput. Informatics Syst..

[29]  Peter van Beek,et al.  A graph theoretic approach to cache-conscious placement of data for direct mapped caches , 2010, ISMM '10.

[30]  Felix Hill,et al.  SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation , 2014, CL.

[31]  Sarit Chakraborty,et al.  An Improved Text Sentiment Classification Model Using TF-IDF and Next Word Negation , 2018, ArXiv.

[32]  Aaditeshwar Seth,et al.  Achieving privacy and security in radio frequency identification , 2006, PST.

[33]  Adeel Zafar,et al.  Using patterns as objectives for general video game level generation , 2019, J. Int. Comput. Games Assoc..

[34]  Hammad Majeed,et al.  Fairness in Real-Time Energy Pricing for Smart Grid Using Unsupervised Learning , 2018, Comput. J..

[35]  H. Mujtaba,et al.  Emotion Detection in Roman Urdu Text using Machine Learning , 2020, 2020 35th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW).

[36]  Sajid Ali,et al.  Deceptive Level Generator , 2018, AIIDE Workshops.

[38]  Dan Klein,et al.  Faster and Smaller N-Gram Language Models , 2011, ACL.

[40]  Eneko Agirre,et al.  A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches , 2009, NAACL.

[41]  Muhammad Kamran Malik,et al.  Urdu Named Entity Recognition and Classification System Using Artificial Neural Network , 2017, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[42]  Waseem Shahzad,et al.  Corpus for Emotion Detection on Roman Urdu , 2019, 2019 22nd International Multitopic Conference (INMIC).

[43]  David R. Cheriton,et al.  Critical Path Heuristic for Automatic Parallelization , 2008 .

[44]  Ali Daud,et al.  Part of Speech Tagging in Urdu: Comparison of Machine and Deep Learning Approaches , 2019, IEEE Access.

[45]  Nisheeth Joshi,et al.  Design & development of rule based inflectional and derivational Urdu stemmer ‘Usal’ , 2015, 2015 International Conference on Futuristic Trends on Computational Analysis and Knowledge Management (ABLAZE).

[46]  Rong Jin,et al.  Understanding bag-of-words model: a statistical framework , 2010, Int. J. Mach. Learn. Cybern..

[47]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[48]  Paul Rayson,et al.  COUNTER: corpus of Urdu news text reuse , 2017, Lang. Resour. Evaluation.

[49]  Saif Ur Rehman Khan,et al.  MELTA: A Method Level Energy Estimation Technique for Android Development , 2019, 2019 International Conference on Innovative Computing (ICIC).

[50]  Mirza O. Beg,et al.  Domain Specific Emotion Lexicon Expansion , 2018, 2018 14th International Conference on Emerging Technologies (ICET).

[51]  Ondrej Bojar,et al.  A Tagged Corpus and a Tagger for Urdu , 2014, LREC.

[52]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[53]  Mirza Omer Beg,et al.  A Methodology for Relating Software Structure with Energy Consumption , 2017, 2017 IEEE 17th International Working Conference on Source Code Analysis and Manipulation (SCAM).

[54]  Mubashar Nazar Awan,et al.  Algorithmic Machine Learning for Prediction of Stock Prices , 2019, FinTech as a Disruptive Technology for Financial Institutions.

[55]  A. Imdad,et al.  Case 2: Recurrent Anemia in a 10-year-old Girl. , 2015, Pediatrics in review.