论文信息 - Are All Languages Equally Hard to Language-Model?

Are All Languages Equally Hard to Language-Model?

For general modeling methods applied to diverse languages, a natural question is: how well should we expect our models to work on languages with differing typological profiles? In this work, we develop an evaluation framework for fair cross-linguistic comparison of language models, using translated text so that all models are asked to predict approximately the same information. We then conduct a study on 21 languages, demonstrating that in some languages, the textual expression of the information is harder to predict with both n-gram and LSTM language models. We show complex inflectional morphology to be a cause of performance differences among languages.

Ryan Cotterell | Brian Roark | Jason Eisner | Sebastian J. Mielke

[1] J. McWhorter,et al. The worlds simplest grammars are creole grammars , 2001 .

[2] Ryan Cotterell,et al. A Rich Morphological Tagger for English: Exploring the Cross-Linguistic Tradeoff Between Morphology and Syntax , 2017, EACL.

[3] Yoav Goldberg,et al. Exploring the Syntactic Abilities of RNNs with Multi-task Learning , 2017, CoNLL.

[4] Hermann Ney,et al. Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[5] Mona Baker,et al. 'Corpus Linguistics and Translation Studies: Implications and Applications' , 1993 .

[6] Lukás Burget,et al. Recurrent neural network based language model , 2010, INTERSPEECH.

[7] Jan Hajic,et al. UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing , 2016, LREC.

[8] Wojciech Zaremba,et al. Recurrent Neural Network Regularization , 2014, ArXiv.

[9] Hinrich Schütze,et al. A Comparative Investigation of Morphological Language Modeling for the Languages of the European Union , 2012, HLT-NAACL.

[10] Hermann Ney,et al. LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[11] Hermann Ney,et al. Open vocabulary speech recognition with flat hybrid models , 2005, INTERSPEECH.

[12] Emmanuel Dupoux,et al. Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies , 2016, TACL.

[13] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[14] Chris Dyer,et al. Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling , 2017, ACL.

[15] Bruno Cartoni,et al. A Database for Measuring Linguistic Information Content , 2014, LREC.

[16] Philipp Koehn,et al. Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[17] H. Robbins. A Stochastic Approximation Method , 1951 .