Language Scaling: Applications, Challenges and Approaches

Language scaling aims to deploy Natural Language Processing (NLP) applications economically across many countries/regions with different languages. Language scaling has been heavily invested by industry since many parties want to deploy their applications/services to global markets. At the same time, scaling out NLP applications to various languages, essentially a data science problem, remains a grand challenge due to the huge differences in the morphology, syntaxes, and pragmatics among different languages. We present a comprehensive survey and tutorial on language scaling. We start with a clear problem description for language scaling and an intuitive discussion on the overall challenges. Then, we outline two major categories of approaches to language scaling, namely, model transfer and data transfer. We present a taxonomy to summarize various methods in literature. A large part of the tutorial is organized to address various types of NLP applications. Finally, we discuss several important challenges in this area and future directions.

[1]  Wei Zhao,et al.  Improving Grammatical Error Correction via Pre-Training a Copy-Augmented Architecture with Unlabeled Data , 2019, NAACL.

[2]  Ming Gong,et al.  Cross-lingual Machine Reading Comprehension with Language Branch Knowledge Distillation , 2020, COLING.

[3]  Colin Raffel,et al.  mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2021, NAACL.

[4]  Jörg Tiedemann,et al.  Synthetic Treebanking for Cross-Lingual Dependency Parsing , 2016, J. Artif. Intell. Res..

[5]  Wanxiang Che,et al.  Cross-Lingual Machine Reading Comprehension , 2019, EMNLP/IJCNLP.

[6]  Claire Cardie,et al.  Multi-Source Cross-Lingual Model Transfer: Learning What to Share , 2018, ACL.

[7]  Biqing Huang,et al.  Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language , 2020, ACL.

[8]  Ruslan Salakhutdinov,et al.  Multi-Task Cross-Lingual Sequence Tagging from Scratch , 2016, ArXiv.

[9]  Peng Xu,et al.  Attention-Informed Mixed-Language Training for Zero-shot Cross-lingual Task-oriented Dialogue Systems , 2019, AAAI.

[10]  Mamoru Komachi,et al.  Cross-lingual Transfer Learning for Grammatical Error Correction , 2020, COLING.

[11]  Nan Duan,et al.  Enhancing Answer Boundary Detection for Multilingual Machine Reading Comprehension , 2020, ACL.

[12]  Rui Huang,et al.  Low-Resource Sequence Labeling via Unsupervised Multilingual Contextualized Representations , 2019, EMNLP/IJCNLP.

[13]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[14]  Kentaro Inui,et al.  Encoder-Decoder Models Can Benefit from Pre-trained Masked Language Models in Grammatical Error Correction , 2020, ACL.

[15]  Yijia Liu,et al.  Cross-Lingual BERT Transformation for Zero-Shot Dependency Parsing , 2019, EMNLP.

[16]  Slav Petrov,et al.  Multi-Source Transfer of Delexicalized Dependency Parsers , 2011, EMNLP.

[17]  E. Kochmar,et al.  Neural Grammatical Error Correction Systems with Unsupervised Pre-training on Synthetic Data , 2019 .

[18]  Biqing Huang,et al.  UniTrans : Unifying Model Transfer and Data Transfer for Cross-Lingual Named Entity Recognition with Unlabeled Data , 2020, IJCAI.

[19]  David Yarowsky,et al.  Cross-lingual Dependency Parsing Based on Distributed Representations , 2015, ACL.

[20]  Ke Xu,et al.  Improving Grammatical Error Correction with Machine Translation Pairs , 2020, EMNLP.

[21]  Zhe Gan,et al.  FILTER: An Enhanced Fusion Method for Cross-lingual Language Understanding , 2021, AAAI.

[22]  Michael Sejr Schlichtkrull,et al.  Cross-Lingual Dependency Parsing with Late Decoding for Truly Low-Resource Languages , 2017, EACL.

[23]  Sebastian Riedel,et al.  MLQA: Evaluating Cross-lingual Extractive Question Answering , 2019, ACL.

[24]  Li Dong,et al.  Cross-Lingual Natural Language Generation via Pre-Training , 2020, AAAI.

[25]  Wanxiang Che,et al.  CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot Cross-Lingual NLP , 2020, International Joint Conference on Artificial Intelligence.

[26]  Wanli Zuo,et al.  CalibreNet: Calibration Networks for Multilingual Sequence Labeling , 2020, ArXiv.