Complementary Strategies for Low Resourced Morphological Modeling

Morphologically rich languages are challenging for natural language processing tasks due to data sparsity. This can be addressed either by introducing out-of-context morphological knowledge, or by developing machine learning architectures that specifically target data sparsity and/or morphological information. We find these approaches to complement each other in a morphological paradigm modeling task in Modern Standard Arabic, which, in addition to being morphologically complex, features ubiquitous ambiguity, exacerbating sparsity with noise. Given a small number of outof-context rules describing closed class morphology, we combine them with word embeddings leveraging subword strings and noise reduction techniques. The combination outperforms both approaches individually by about 20% absolute. While morphological resources already exist for Modern Standard Arabic, our results inform how comparable resources might be constructed for non-standard dialects or any morphologically rich, low resourced language, given scarcity of time and funding.

[1]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[2]  Regina Barzilay,et al.  An Unsupervised Method for Uncovering Morphological Chains , 2015, TACL.

[3]  Nizar Habash,et al.  Introduction to Arabic Natural Language Processing , 2010, Introduction to Arabic Natural Language Processing.

[4]  Yoav Goldberg,et al.  The Interplay of Semantics and Morphology in Word Embeddings , 2017, EACL.

[5]  Alexander Erdmann,et al.  Noise-Robust Morphological Disambiguation for Dialectal Arabic , 2018, NAACL.

[6]  Raymond J. Mooney,et al.  Multi-Prototype Vector-Space Models of Word Meaning , 2010, NAACL.

[7]  Christiane Fellbaum,et al.  Introducing the Arabic WordNet project , 2006 .

[8]  Nizar Habash,et al.  First Result on Arabic Neural Machine Translation , 2016, ArXiv.

[9]  Radu Soricut,et al.  Unsupervised Morphology Induction Using Word Embeddings , 2015, NAACL.

[10]  Magnus Sahlgren,et al.  Navigating the Semantic Horizon using Relative Neighborhood Graphs , 2015, EMNLP.

[11]  Tianchun Yang,et al.  Extending the Use of Adaptor Grammars for Unsupervised Morphological Segmentation of Unseen Languages , 2016, COLING.

[12]  Alexander Erdmann,et al.  Addressing Noise in Multidialectal Word Embeddings , 2018, ACL.

[13]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[14]  Zhiyuan Liu,et al.  A Unified Model for Word Sense Representation and Disambiguation , 2014, EMNLP.

[15]  Manaal Faruqui,et al.  Morpho-syntactic Lexicon Generation Using Graph-based Semi-supervised Learning , 2015, TACL.

[16]  Nizar Habash,et al.  MAGEAD: A Morphological Analyzer and Generator for the Arabic Dialects , 2006, ACL.

[17]  Daniel Müllner,et al.  fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python , 2013 .

[18]  Regina Barzilay,et al.  Unsupervised Learning of Morphological Forests , 2017, Transactions of the Association for Computational Linguistics.

[19]  Nizar Habash,et al.  A Large Scale Arabic Sentiment Lexicon for Arabic Opinion Mining , 2014, ANLP@EMNLP.

[20]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[21]  Pramod Viswanath,et al.  Fixing the Infix: Unsupervised Discovery of Root-and-Pattern Morphology , 2017, ArXiv.

[22]  Sharon Goldwater,et al.  From Segmentation to Analyses: a Probabilistic Model for Unsupervised Morphology Induction , 2017, EACL.

[23]  Nizar Habash,et al.  A Morphologically Annotated Corpus of Emirati Arabic , 2018, LREC.

[24]  Kareem Darwish,et al.  Building a Shallow Arabic Morphological Analyser in One Day , 2002, SEMITIC@ACL.

[25]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[26]  Mathias Creutz,et al.  Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0 , 2005 .

[27]  Otakar Smrž,et al.  The Prague Bulletin of Mathematical Linguistics Functional Arabic Morphology , 2007 .

[28]  Nizar Habash,et al.  Morphologically Annotated Corpora and Morphological Analyzers for Moroccan and Sanaani Yemeni Arabic , 2016, LREC.

[29]  Nizar Habash,et al.  A Morphological Analyzer for Gulf Arabic Verbs , 2017, WANLP@EACL.

[30]  Kenneth R. Beesley,et al.  Arabic Morphology Using Only Finite-State Operations , 1998, SEMITIC@COLING.

[31]  Lifu Tu,et al.  Learning to Embed Words in Context for Syntactic Tasks , 2017, Rep4NLP@ACL.

[32]  Ibrahim Mohamed Hassan. Saleh,et al.  Automatic Extraction of Lemma-based Bilingual Dictionaries for Morphologically Rich Languages , 2009, MTSUMMIT.

[33]  Houda Bouamor,et al.  Low Resourced Machine Translation via Morpho-syntactic Modeling: The Case of Dialectal Arabic , 2017, MTSUMMIT.

[34]  Ryan Cotterell,et al.  The SIGMORPHON 2016 Shared Task—Morphological Reinflection , 2016, SIGMORPHON.

[35]  Nizar Habash,et al.  Orthographic and morphological processing for English–Arabic statistical machine translation , 2011, Machine Translation.

[36]  Nizar Habash,et al.  ADAM: Analyzer for Dialectal Arabic Morphology , 2014, J. King Saud Univ. Comput. Inf. Sci..

[37]  Nizar Habash,et al.  Unsupervised Morphology-Based Vocabulary Expansion , 2014, ACL.

[38]  Nizar Habash,et al.  Parsing Arabic Dialects , 2006, EACL.

[39]  Nizar Habash,et al.  Don’t Throw Those Morphological Analyzers Away Just Yet: Neural Morphological Disambiguation for Arabic , 2017, EMNLP.

[40]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[41]  Elise van der Pol,et al.  Inflecting Verbs with Word Embeddings: A Systematic Investigation of Morphological Information Captured by German Verb Embeddings , 2017 .

[42]  Kris Cao,et al.  A Joint Model for Word Embedding and Word Morphology , 2016, Rep4NLP@ACL.

[43]  Nizar Habash,et al.  Building a Corpus for Palestinian Arabic: a Preliminary Study , 2014, ANLP@EMNLP.

[44]  Hiroyuki Shindo,et al.  Joint Prediction of Morphosyntactic Categories for Fine-Grained Arabic Part-of-Speech Tagging Exploiting Tag Dictionary Information , 2017, CoNLL.

[45]  Regina Barzilay,et al.  Climbing the Tower of Babel: Unsupervised Multilingual Learning , 2010, ICML.

[46]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[47]  Ryan Cotterell,et al.  Morphological Word-Embeddings , 2019, NAACL.

[48]  Philippe Blache,et al.  Morphological disambiguation of Tunisian dialect , 2017, J. King Saud Univ. Comput. Inf. Sci..

[49]  David Gilmore,et al.  Modeling Order in Neural Word Embeddings at Scale , 2015, ICML.

[50]  Nizar Habash,et al.  YAMAMA: Yet Another Multi-Dialect Arabic Morphological Analyzer , 2016, COLING.

[51]  Nadir Durrani,et al.  Farasa: A Fast and Furious Segmenter for Arabic , 2016, NAACL.

[52]  Nizar Habash,et al.  Creating Resources for Dialectal Arabic from a Single Annotation: A Case Study on Egyptian and Levantine , 2016, COLING.

[53]  Bartunov Sergey,et al.  Breaking Sticks and Ambiguities with Adaptive Skip-gram , 2016 .

[54]  Yoav Goldberg,et al.  A Primer on Neural Network Models for Natural Language Processing , 2015, J. Artif. Intell. Res..

[55]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[56]  Shady Elbassuoni,et al.  Methodical Evaluation of Arabic Word Embeddings , 2017, ACL.

[57]  Nizar Habash,et al.  50th Annual Meeting of the Association for Computational Linguistics Proceedings of the Conference Volume 2: Short Papers , 2012 .

[58]  Nizar Habash,et al.  On Arabic Transliteration , 2007 .

[59]  Nizar Habash,et al.  Automatic Extraction of Morphological Lexicons from Morphologically Annotated Corpora , 2013, EMNLP.

[60]  Nizar Habash,et al.  A Corpus for Modeling Morpho-Syntactic Agreement in Arabic: Gender, Number and Rationality , 2011, ACL.