NEZHA: Neural Contextualized Representation for Chinese Language Understanding

The pre-trained language models have achieved great successes in various natural language understanding (NLU) tasks due to its capacity to capture the deep contextualized information in text by pre-training on large-scale corpora. In this technical report, we present our practice of pre-training language models named NEZHA (NEural contextualiZed representation for CHinese lAnguage understanding) on Chinese corpora and finetuning for the Chinese NLU tasks. The current version of NEZHA is based on BERT with a collection of proven improvements, which include Functional Relative Positional Encoding as an effective positional encoding scheme, Whole Word Masking strategy, Mixed Precision Training and the LAMB Optimizer in training the models. The experimental results show that NEZHA achieves the state-of-the-art performances when finetuned on several representative Chinese tasks, including named entity recognition (People's Daily NER), sentence matching (LCQMC), Chinese sentiment classification (ChnSenti) and natural language inference (XNLI).

[1]  Hao Tian,et al.  ERNIE 2.0: A Continual Pre-training Framework for Language Understanding , 2019, AAAI.

[2]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[3]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[4]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[5]  Yu Sun,et al.  ERNIE: Enhanced Representation through Knowledge Integration , 2019, ArXiv.

[6]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[7]  Maosong Sun,et al.  ERNIE: Enhanced Language Representation with Informative Entities , 2019, ACL.

[8]  Guillaume Lample,et al.  XNLI: Evaluating Cross-lingual Sentence Representations , 2018, EMNLP.

[9]  Wentao Ma,et al.  A Span-Extraction Dataset for Chinese Machine Reading Comprehension , 2019, EMNLP-IJCNLP.

[10]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[11]  Wanxiang Che,et al.  Pre-Training with Whole Word Masking for Chinese BERT , 2019, ArXiv.

[12]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[13]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[14]  Yu Sun,et al.  ERNIE 2 . 0 : A CONTINUAL PRE-TRAINING FRAMEWORK FOR LANGUAGE UNDERSTANDING , 2019 .

[15]  Xin Liu,et al.  LCQMC:A Large-scale Chinese Question Matching Corpus , 2018, COLING.

[16]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[17]  James Demmel,et al.  Reducing BERT Pre-Training Time from 3 Days to 76 Minutes , 2019, ArXiv.

[18]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[19]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.