ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information

Recent pretraining models in Chinese neglect two important aspects specific to the Chinese language: glyph and pinyin, which carry significant syntax and semantic information for language understanding. In this work, we propose ChineseBERT, which incorporates both the glyph and pinyin information of Chinese characters into language model pretraining. The glyph embedding is obtained based on different fonts of a Chinese character, being able to capture character semantics from the visual features, and the pinyin embedding characterizes the pronunciation of Chinese characters, which handles the highly prevalent heteronym phenomenon in Chinese (the same character has different pronunciations with different meanings). Pretrained on large-scale unlabeled Chinese corpus, the proposed ChineseBERT model yields significant performance boost over baseline models with fewer training steps. The proposed model achieves new SOTA performances on a wide range of Chinese NLP tasks, including machine reading comprehension, natural language inference, text classification, sentence pair matching, and competitive performances in named entity recognition and word segmentation.

[1]  Huanqi Cao,et al.  CPM: A Large-scale Generative Chinese Pre-trained Language Model , 2020, AI Open.

[2]  Lucy J. Colwell,et al.  Rethinking Attention with Performers , 2020, ICLR.

[3]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[4]  Fabio Petroni,et al.  Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.

[5]  Dian Yu,et al.  CLUE: A Chinese Language Understanding Evaluation Benchmark , 2020, COLING.

[6]  Wanxiang Che,et al.  Revisiting Pre-Trained Models for Chinese Natural Language Processing , 2020, FINDINGS.

[7]  Jianfeng Gao,et al.  UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training , 2020, ICML.

[8]  Tao Qin,et al.  Incorporating BERT into Neural Machine Translation , 2020, ICLR.

[9]  Jiwei Li,et al.  Description Based Text Classification with Reinforcement Learning , 2020, ICML.

[10]  Shengyi Jiang,et al.  FGN: Fusion Glyph Network for Chinese Named Entity Recognition , 2020, CCKS.

[11]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[12]  Jiwei Li,et al.  A Unified MRC Framework for Named Entity Recognition , 2019, ACL.

[13]  Zhiyuan Liu,et al.  CJRC: A Reliable Human-Annotated Benchmark DataSet for Chinese Judicial Reading Comprehension , 2019, CCL.

[14]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[15]  Chan Hee Song,et al.  Using Chinese Glyphs for Named Entity Recognition , 2019, AAAI.

[16]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[17]  Hao Tian,et al.  ERNIE 2.0: A Continual Pre-training Framework for Language Understanding , 2019, AAAI.

[18]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[19]  Omer Levy,et al.  SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[20]  Guillaume Lample,et al.  Large Memory Layers with Product Keys , 2019, NeurIPS.

[21]  Wanxiang Che,et al.  Pre-Training with Whole Word Masking for Chinese BERT , 2019, ArXiv.

[22]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[23]  Enhong Chen,et al.  Chinese Embedding via Stroke and Glyph Information: A Dual-channel View , 2019, ArXiv.

[24]  Jiwei Li,et al.  Is Word Segmentation Necessary for Deep Learning of Chinese Representations? , 2019, ACL.

[25]  Xiaodong Liu,et al.  Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[26]  Xu Tan,et al.  MASS: Masked Sequence to Sequence Pre-training for Language Generation , 2019, ICML.

[27]  Yu Sun,et al.  ERNIE: Enhanced Representation through Knowledge Integration , 2019, ArXiv.

[28]  Xiaodong Liu,et al.  Multi-Task Deep Neural Networks for Natural Language Understanding , 2019, ACL.

[29]  Wei Wu,et al.  Glyce: Glyph-vectors for Chinese Character Representations , 2019, NeurIPS.

[30]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[31]  Guillaume Lample,et al.  XNLI: Evaluating Cross-lingual Sentence Representations , 2018, EMNLP.

[32]  Xin Liu,et al.  LCQMC:A Large-scale Chinese Question Matching Corpus , 2018, COLING.

[33]  Christopher Clark,et al.  Simple and Effective Multi-Paragraph Reading Comprehension , 2017, ACL.

[34]  Falcon Z. Dai,et al.  Glyph-aware Embedding of Chinese Characters , 2017, SWCN@EMNLP.

[35]  Hung-yi Lee,et al.  Learning Chinese Word Representations From Glyphs Of Characters , 2017, EMNLP.

[36]  Xiang Zhang,et al.  Which Encoding is the Best for Text Classification in Chinese, English, Japanese and Korean? , 2017, ArXiv.

[37]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[38]  Frederick Liu,et al.  Learning Character-level Compositionality with Visual Features , 2017, ACL.

[39]  Jörg Tiedemann,et al.  Character-based Joint Segmentation and POS Tagging for Chinese using Bidirectional RNN-CRF , 2017, IJCNLP.

[40]  Rui Li,et al.  Multi-Granularity Chinese Word Embedding , 2016, EMNLP.

[41]  Quoc V. Le,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[42]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[43]  Eric Nichols,et al.  Named Entity Recognition with Bidirectional LSTM-CNNs , 2015, TACL.

[44]  Nanyun Peng,et al.  Named Entity Recognition for Chinese Social Media with Jointly Trained Embeddings , 2015, EMNLP.

[45]  Alexandra Birch,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[46]  Wenjie Li,et al.  Component-Enhanced Chinese Character Embeddings , 2015, EMNLP.

[47]  Chao Liu,et al.  Radical Embedding: Delving Deeper to Chinese Radicals , 2015, ACL.

[48]  Nan Yang,et al.  Radical-Enhanced Chinese Character Embedding , 2014, ICONIP.

[49]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[50]  J. Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[51]  Wanxiang Che,et al.  LTP: A Chinese Language Technology Platform , 2010, COLING.

[52]  Maosong Sun,et al.  Scalable Term Selection for Text Categorization , 2007, EMNLP.

[53]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[54]  Wentao Ma,et al.  A Span-Extraction Dataset for Chinese Machine Reading Comprehension , 2019, EMNLP-IJCNLP.

[55]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[56]  Xin Liu,et al.  The BQ Corpus: A Large-scale Domain-specific Chinese Corpus For Sentence Semantic Equivalence Identification , 2018, EMNLP.