Dict-BERT: Enhancing Language Model Pre-training with Dictionary

Pre-trained language models (PLMs) aim to learn universal language representations by conducting self-supervised training tasks on large-scale corpora. Since PLMs capture word semantics in different contexts, the quality of word representations highly depends on word frequency, which usually follows a heavy-tailed distributions in the pre-training corpus. Therefore, the embeddings of rare words on the tail are usually poorly optimized. In this work, we focus on enhancing language model pre-training by leveraging definitions of the rare words in dictionaries (e.g., Wiktionary). To incorporate a rare word definition as a part of input, we fetch its definition from the dictionary and append it to the end of the input text sequence. In addition to training with the masked language modeling objective, we propose two novel self-supervised pre-training tasks on word and sentence-level alignment between input text sequence and rare word definitions to enhance language modeling representation with dictionary. We evaluate the proposed Dict-BERT model on the language understanding benchmark GLUE and eight specialized domain benchmark datasets. Extensive experiments demonstrate that Dict-BERT can significantly improve the understanding of rare words and boost model performance on various NLP downstream tasks.

[1]  Lianhui Qin,et al.  Diversifying Content Generation for Commonsense Reasoning with Mixture of Knowledge Graph Experts , 2022, FINDINGS.

[2]  Shuohang Wang,et al.  KG-FiD: Infusing Knowledge Graph in Fusion-in-Decoder for Open-Domain Question Answering , 2021, ACL.

[3]  Elena Sofia Ruzzetti,et al.  Lacking the Embedding of a Word? Look it up into a Traditional Dictionary , 2021, FINDINGS.

[4]  Leyang Cui,et al.  Knowledge Enhanced Fine-Tuning for Better Handling Unseen Entities in Dialogue Generation , 2021, EMNLP.

[5]  Michael Zeng,et al.  Does Knowledge Help General NLU? An Empirical Study , 2021, ArXiv.

[6]  Bill Yuchen Lin,et al.  Pre-training Text-to-Text Transformers for Concept-centric Common Sense , 2020, ICLR.

[7]  Zhiting Hu,et al.  A Survey of Knowledge-enhanced Text Generation , 2020, ACM Comput. Surv..

[8]  Donghan Yu,et al.  JAKET: Joint Pre-training of Knowledge Graph and Language Understanding , 2020, AAAI.

[9]  Chenguang Zhu,et al.  Injecting Entity Types into Entity-Guided Text Generation , 2020, EMNLP.

[10]  Philip S. Yu,et al.  KG-BART: Knowledge Graph-Augmented BART for Generative Commonsense Reasoning , 2020, AAAI.

[11]  Tie-Yan Liu,et al.  Taking Notes on the Fly Helps BERT Pre-training , 2020, ArXiv.

[12]  Chenguang Zhu,et al.  Mind The Facts: Knowledge-Boosted Coherent Abstractive Text Summarization , 2020, ArXiv.

[13]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[14]  Xipeng Qiu,et al.  Pre-trained models for natural language processing: A survey , 2020, Science China Technological Sciences.

[15]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[16]  Xuanjing Huang,et al.  K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters , 2020, FINDINGS.

[17]  Minlie Huang,et al.  A Knowledge-Enhanced Pretraining Model for Commonsense Story Generation , 2020, TACL.

[18]  Wenhan Xiong,et al.  Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model , 2019, ICLR.

[19]  Zhiyuan Liu,et al.  KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation , 2019, Transactions of the Association for Computational Linguistics.

[20]  Peter J. Liu,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[21]  Teven Le Scao,et al.  Transformers: State-of-the-Art Natural Language Processing , 2019, EMNLP.

[22]  Zhe Zhao,et al.  K-BERT: Enabling Language Representation with Knowledge Graph , 2019, AAAI.

[23]  Michael Tschannen,et al.  On Mutual Information Maximization for Representation Learning , 2019, ICLR.

[24]  Di He,et al.  Representation Degeneration Problem in Training Natural Language Generation Models , 2019, ICLR.

[25]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[26]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[27]  Maosong Sun,et al.  ERNIE: Enhanced Language Representation with Informative Entities , 2019, ACL.

[28]  Alexander A. Alemi,et al.  On Variational Bounds of Mutual Information , 2019, ICML.

[29]  Hinrich Schütze,et al.  Rare Words: A Major Problem for Contextualized Embeddings And How to Fix it by Attentive Mimicking , 2019, AAAI.

[30]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[31]  Di He,et al.  FRAGE: Frequency-Agnostic Word Representation , 2018, NeurIPS.

[32]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[33]  Samuel R. Bowman,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[34]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[35]  Doug Downey,et al.  Definition Modeling: Learning to Define Word Embeddings in Natural Language , 2016, AAAI.

[36]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[37]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[38]  Jason Weston,et al.  Translating Embeddings for Modeling Multi-relational Data , 2013, NIPS.

[39]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[40]  Pascal Vincent,et al.  Auto-Encoding Dictionary Definitions into Consistent Word Embeddings , 2018, EMNLP.

[41]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .