RUBERT: A Bilingual Roman Urdu BERT Using Cross Lingual Transfer Learning

In recent studies it has been shown that Multilingual language models under perform their monolingual counterparts (Conneau et al., 2020). It is also a well known fact that training and maintaining monolingual models for each language is a costly and time consuming process. Roman Urdu is a resource starved language used popularly on social media platforms and chat apps. In this research we propose a novel dataset of scraped tweets containing 54M tokens and 3M sentences. Additionally we also propose RUBERT a bilingual Roman Urdu model created by additional pretraining of English BERT (Devlin et al., 2019). We compare its performance with a monolingual Roman Urdu BERT trained from scratch and a multilingual Roman Urdu BERT created by additional pretraining of Multilingual BERT (mBERT (Devlin et al., 2019)). We show through our experiments that additional pretraining of the English BERT produces the most notable performance improvement.

[1]  Laurent Romary,et al.  CamemBERT: a Tasty French Language Model , 2019, ACL.

[2]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[3]  Muhammad Tariq,et al.  Accurate detection of sitting posture activities in a secure IoT based assisted living environment , 2018, Future Gener. Comput. Syst..

[4]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[5]  Bilal Tahir,et al.  Sentiment and Emotion Analysis of Text: A Survey on Approaches and Resources , 2020 .

[6]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[8]  Mirza Omer Beg FLECS: A Data-Driven Framework for Rapid Protocol Prototyping , 2007 .

[9]  Peter van Beek,et al.  A graph theoretic approach to cache-conscious placement of data for direct mapped caches , 2010, ISMM '10.

[10]  Thar Baker,et al.  AlphaLogger: detecting motion-based side-channel attack using smartphone keystrokes , 2020, Journal of Ambient Intelligence and Humanized Computing.

[11]  Aaditeshwar Seth,et al.  Achieving privacy and security in radio frequency identification , 2006, PST.

[12]  Mirza Beg FLECS: A Framework for Rapidly Implementing Forwarding Protocols , 2009, Complex.

[13]  Tapio Salakoski,et al.  Multilingual is not enough: BERT for Finnish , 2019, ArXiv.

[14]  Naveed Akhtar,et al.  Subspace Gaussian Mixture Model for Continuous Urdu Speech Recognition using Kaldi , 2020, 2020 14th International Conference on Open Source Systems and Technologies (ICOSST).

[15]  Mirza Omer Beg,et al.  Towards energy aware object-oriented development of android applications , 2019, Sustain. Comput. Informatics Syst..

[16]  Muhammad Asim,et al.  DeepDetect: Detection of Distributed Denial of Service Attacks Using Deep Learning , 2019, Comput. J..

[17]  Adeel Zafar,et al.  Using patterns as objectives for general video game level generation , 2019, J. Int. Comput. Games Assoc..

[18]  Hammad Majeed,et al.  Fairness in Real-Time Energy Pricing for Smart Grid Using Unsupervised Learning , 2018, Comput. J..

[19]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[20]  Mubashar Nazar Awan,et al.  TOP-Rank: A TopicalPostionRank for Extraction and Classification of Keyphrases in Text , 2021, Comput. Speech Lang..

[22]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[23]  H. Mujtaba,et al.  Emotion Detection in Roman Urdu Text using Machine Learning , 2020, 2020 35th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW).

[24]  Paul Rayson,et al.  COUNTER: corpus of Urdu news text reuse , 2017, Lang. Resour. Evaluation.

[25]  Adeel Zafar,et al.  Search-based procedural content generation for GVG-LG , 2020, Appl. Soft Comput..

[26]  Sajid Ali,et al.  Deceptive Level Generator , 2018, AIIDE Workshops.

[27]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[28]  Mirza Omer Beg,et al.  A Methodology for Relating Software Structure with Energy Consumption , 2017, 2017 IEEE 17th International Working Conference on Source Code Analysis and Manipulation (SCAM).

[29]  Yejin Choi,et al.  SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference , 2018, EMNLP.

[30]  Mirza O. Beg,et al.  Domain Specific Emotion Lexicon Expansion , 2018, 2018 14th International Conference on Emerging Technologies (ICET).

[31]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[32]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[33]  Sampo Pyysalo,et al.  WikiBERT Models: Deep Transfer Learning for Many Languages , 2020, NODALIDA.

[34]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[35]  M. Beg,et al.  Pulmonary Crohn's Disease in Down Syndrome: A Link or Linkage Problem , 2016, Case Reports in Gastroenterology.

[36]  Hammad Majeed,et al.  Relationship Identification Between Conversational Agents Using Emotion Analysis , 2021, Cogn. Comput..

[37]  Hamza M. Alvi,et al.  EnSights: A tool for energy aware software development , 2017, 2017 13th International Conference on Emerging Technologies (ICET).

[38]  Saif ur Rehman Khan,et al.  BigData Analysis of Stack Overflow for Energy Consumption of Android Framework , 2019, 2019 International Conference on Innovative Computing (ICIC).

[39]  Adeel Zafar,et al.  A Constructive Approach for General Video Game Level Generation , 2019, 2019 11th Computer Science and Electronic Engineering (CEEC).

[40]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[41]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[42]  Mirza Omer Beg,et al.  WEEC: Web Energy Efficient Computing: A machine learning approach , 2019, Sustain. Comput. Informatics Syst..

[43]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[44]  Thar Baker,et al.  A collaborative healthcare framework for shared healthcare plan with ambient intelligence , 2020, Hum. centric Comput. Inf. Sci..

[45]  Mirza Omer Beg,et al.  A deep learning framework for clickbait detection on social area network using natural language cues , 2020, J. Comput. Soc. Sci..

[46]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[47]  Mubashar Nazar Awan,et al.  Algorithmic Machine Learning for Prediction of Stock Prices , 2019, FinTech as a Disruptive Technology for Financial Institutions.

[48]  Waseem Shahzad,et al.  Corpus for Emotion Detection on Roman Urdu , 2019, 2019 22nd International Multitopic Conference (INMIC).

[49]  David R. Cheriton,et al.  Critical Path Heuristic for Automatic Parallelization , 2008 .

[50]  Helena Gómez-Adorno,et al.  "Bend the truth": Benchmark dataset for fake news detection in Urdu language and its evaluation , 2020, J. Intell. Fuzzy Syst..

[51]  A. Imdad,et al.  Case 2: Recurrent Anemia in a 10-year-old Girl. , 2015, Pediatrics in review.

[52]  M. Karsten,et al.  An axiomatic basis for communication , 2007, SIGCOMM '07.

[53]  Peter van Beek,et al.  A constraint programming approach for integrated spatial and temporal scheduling for clustered architectures , 2013, TECS.

[54]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[55]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[56]  David R. Cheriton,et al.  MAXSM: A Multi-Heuristic Approach to XML Schema Matching , 2006 .

[57]  Khan Muhammad,et al.  Understanding Citizen Issues through Reviews: A Step towards Data Informed Planning in Smart Cities , 2018, Applied Sciences.

[58]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[59]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.