Morphology Matters: A Multilingual Language Modeling Analysis

Prior studies in multilingual language modeling (e.g., Cotterell et al., 2018; Mielke et al., 2019) disagree on whether or not inflectional morphology makes languages harder to model. We attempt to resolve the disagreement and extend those studies. We compile a larger corpus of 145 Bible translations in 92 languages and a larger number of typological features. We fill in missing typological data for several languages and consider corpus-based measures of morphological complexity in addition to expert-produced typological features. We find that several morphological measures are significantly associated with higher surprisal when LSTM models are trained with BPE-segmented data. We also investigate linguistically-motivated subword segmentation strategies like Morfessor and Finite-State Transducers (FSTs) and find that these segmentation strategies yield better performance and reduce the impact of a language's morphology on language modeling.

[1]  Richard Socher,et al.  An Analysis of Neural Language Modeling at Multiple Scales , 2018, ArXiv.

[2]  Kimmo Kettunen,et al.  Can Type-Token Ratio be Used to Show Morphological Complexity of Languages?* , 2014, J. Quant. Linguistics.

[3]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[4]  Çağrı Çöltekin,et al.  A Freely Available Morphological Analyzer for Turkish , 2010, LREC.

[5]  Christian Bentz,et al.  A Comparison Between Morphological Complexity Measures: Typological Data vs. Language Corpora , 2016, CL4LC@COLING 2016.

[6]  Ryan Cotterell,et al.  Are All Languages Equally Hard to Language-Model? , 2018, NAACL.

[7]  Pascal Denis,et al.  A Framework for Understanding the Role of Morphology in Universal Dependency Parsing , 2018, EMNLP.

[8]  Septina Dian Larasati,et al.  Indonesian Morphology Tool (MorphInd): Towards an Indonesian Corpus , 2011, SFCM.

[9]  Mikko Kurimo,et al.  Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline , 2013 .

[10]  Jason Eisner,et al.  Spell Once, Summon Anywhere: A Two-Level Open-Vocabulary Language Model , 2018, AAAI.

[11]  A set of open source tools for Turkish natural language processing , 2014, LREC.

[12]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[13]  S. Arikawa,et al.  Byte Pair Encoding: a Text Compression Scheme That Accelerates Pattern Matching , 1999 .

[14]  Tommi A. Pirinen,et al.  Omorfi — Free and open source morphological lexical database for Finnish , 2015, NODALIDA.

[15]  Ryan Cotterell,et al.  UniMorph 3.0: Universal Morphology , 2018, LREC.

[16]  Mark Steedman,et al.  A massively parallel corpus: the Bible in 100 languages , 2014, Lang. Resour. Evaluation.

[17]  Benoît Sagot,et al.  Comparing Complexity Measures , 2013 .

[18]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[19]  Taku Kudo,et al.  Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[20]  Judith L. Klavans Computational Challenges for Polysynthetic Languages , 2018 .

[21]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[22]  Greg Durrett,et al.  Byte Pair Encoding is Suboptimal for Language Model Pretraining , 2020, FINDINGS.

[23]  Mathias Creutz,et al.  Unsupervised models for morpheme segmentation and morphology learning , 2007, TSLP.

[24]  Ryan Cotterell,et al.  What Kind of Language Is Hard to Language-Model? , 2019, ACL.

[25]  Ulrich Heid,et al.  SMOR: A German Computational Morphology Covering Derivation, Composition and Inflection , 2004, LREC.

[26]  Krister Lindén Helsinki Finite-State Technology , 2014 .

[27]  Francis M. Tyers,et al.  Dependency annotation of noun incorporation in polysynthetic languages , 2020, UDW.

[28]  Michael A. Covington,et al.  Cutting the Gordian Knot: The Moving-Average Type–Token Ratio (MATTR) , 2010, J. Quant. Linguistics.

[29]  Flor Cagniy Cárdenas Mariño,et al.  Analizador morfológico de la lengua quechua basado en software libre helsinkifinite-statetransducer (hfst) , 2013 .

[30]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[31]  Katharina Kann,et al.  Lost in Translation: Analysis of Information Loss During Machine Translation Between Polysynthetic and Fusional Languages , 2018, ArXiv.

[32]  Thomas Mayer,et al.  Creating a massively parallel Bible corpus , 2014, LREC.

[33]  Maciej Tomczak,et al.  The need to report effect size estimates revisited. An overview of some recommended measures of effect size , 2014 .

[34]  Adam Lopez,et al.  From Characters to Words to in Between: Do We Capture Morphology? , 2017, ACL.

[35]  Anna Korhonen,et al.  On the Relation between Linguistic Typology and (Limitations of) Multilingual Language Modeling , 2018, EMNLP.

[36]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[37]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .