论文信息 - Morphology Matters: A Multilingual Language Modeling Analysis - 字舞流文

Morphology Matters: A Multilingual Language Modeling Analysis

Prior studies in multilingual language modeling (e.g., Cotterell et al., 2018; Mielke et al., 2019) disagree on whether or not inflectional morphology makes languages harder to model. We attempt to resolve the disagreement and extend those studies. We compile a larger corpus of 145 Bible translations in 92 languages and a larger number of typological features. We fill in missing typological data for several languages and consider corpus-based measures of morphological complexity in addition to expert-produced typological features. We find that several morphological measures are significantly associated with higher surprisal when LSTM models are trained with BPE-segmented data. We also investigate linguistically-motivated subword segmentation strategies like Morfessor and Finite-State Transducers (FSTs) and find that these segmentation strategies yield better performance and reduce the impact of a language's morphology on language modeling.

Han Liu | Lane Schwartz | Kenneth Steimel | Coleman Haley | Hyunji Hayley Park | Katherine J. Zhang | Hyunji Hayley Park | Lane Schwartz | Coleman Haley | K. Steimel | Han Liu

[1] Richard Socher,et al. An Analysis of Neural Language Modeling at Multiple Scales , 2018, ArXiv.

[2] Kimmo Kettunen,et al. Can Type-Token Ratio be Used to Show Morphological Complexity of Languages?* , 2014, J. Quant. Linguistics.

[3] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[4] Çağrı Çöltekin,et al. A Freely Available Morphological Analyzer for Turkish , 2010, LREC.

[5] Christian Bentz,et al. A Comparison Between Morphological Complexity Measures: Typological Data vs. Language Corpora , 2016, CL4LC@COLING 2016.

[6] Ryan Cotterell,et al. Are All Languages Equally Hard to Language-Model? , 2018, NAACL.

[7] Pascal Denis,et al. A Framework for Understanding the Role of Morphology in Universal Dependency Parsing , 2018, EMNLP.

[8] Septina Dian Larasati,et al. Indonesian Morphology Tool (MorphInd): Towards an Indonesian Corpus , 2011, SFCM.

[9] Mikko Kurimo,et al. Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline , 2013 .

[10] Jason Eisner,et al. Spell Once, Summon Anywhere: A Two-Level Open-Vocabulary Language Model , 2018, AAAI.

[11] A set of open source tools for Turkish natural language processing , 2014, LREC.

[12] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[13] S. Arikawa,et al. Byte Pair Encoding: a Text Compression Scheme That Accelerates Pattern Matching , 1999 .

[14] Tommi A. Pirinen,et al. Omorfi — Free and open source morphological lexical database for Finnish , 2015, NODALIDA.

[15] Ryan Cotterell,et al. UniMorph 3.0: Universal Morphology , 2018, LREC.

[16] Mark Steedman,et al. A massively parallel corpus: the Bible in 100 languages , 2014, Lang. Resour. Evaluation.

[17] Benoît Sagot,et al. Comparing Complexity Measures , 2013 .

[18] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[19] Taku Kudo,et al. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[20] Judith L. Klavans. Computational Challenges for Polysynthetic Languages , 2018 .

[21] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[22] Greg Durrett,et al. Byte Pair Encoding is Suboptimal for Language Model Pretraining , 2020, FINDINGS.

[23] Mathias Creutz,et al. Unsupervised models for morpheme segmentation and morphology learning , 2007, TSLP.

[24] Ryan Cotterell,et al. What Kind of Language Is Hard to Language-Model? , 2019, ACL.

[25] Ulrich Heid,et al. SMOR: A German Computational Morphology Covering Derivation, Composition and Inflection , 2004, LREC.

[26] Krister Lindén. Helsinki Finite-State Technology , 2014 .

[27] Francis M. Tyers,et al. Dependency annotation of noun incorporation in polysynthetic languages , 2020, UDW.

[28] Michael A. Covington,et al. Cutting the Gordian Knot: The Moving-Average Type–Token Ratio (MATTR) , 2010, J. Quant. Linguistics.

[29] Flor Cagniy Cárdenas Mariño,et al. Analizador morfológico de la lengua quechua basado en software libre helsinkifinite-statetransducer (hfst) , 2013 .

[30] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[31] Katharina Kann,et al. Lost in Translation: Analysis of Information Loss During Machine Translation Between Polysynthetic and Fusional Languages , 2018, ArXiv.

[32] Thomas Mayer,et al. Creating a massively parallel Bible corpus , 2014, LREC.

[33] Maciej Tomczak,et al. The need to report effect size estimates revisited. An overview of some recommended measures of effect size , 2014 .

[34] Adam Lopez,et al. From Characters to Words to in Between: Do We Capture Morphology? , 2017, ACL.

[35] Anna Korhonen,et al. On the Relation between Linguistic Typology and (Limitations of) Multilingual Language Modeling , 2018, EMNLP.

[36] Philipp Koehn,et al. Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[37] Y. Benjamini,et al. Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .