BERTAC: Enhancing Transformer-based Language Models with Adversarially Pretrained Convolutional Neural Networks

Transformer-based language models (TLMs), such as BERT, ALBERT and GPT-3, have shown strong performance in a wide range of NLP tasks and currently dominate the field of NLP. However, many researchers wonder whether these models can maintain their dominance forever. Of course, we do not have answers now, but, as an attempt to find better neural architectures and training schemes, we pretrain a simple CNN using a GAN-style learning scheme and Wikipedia data, and then integrate it with standard TLMs. We show that on the GLUE tasks, the combination of our pretrained CNN with ALBERT outperforms the original ALBERT and achieves a similar performance to that of SOTA. Furthermore, on open-domain QA (Quasar-T and SearchQA), the combination of the CNN with ALBERT or RoBERTa achieved stronger performance than SOTA and the original TLMs. We hope that this work provides a hint for developing a novel strong network architecture along with its training scheme. Our source code and models are available at https://github.com/nict-wisdom/bertac.

[1]  Ion Androutsopoulos,et al.  Domain Adversarial Fine-Tuning as an Effective Regularizer , 2020, FINDINGS.

[2]  Yu Cheng,et al.  Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding , 2020, ArXiv.

[3]  Deniz Yuret,et al.  KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media , 2020, SEMEVAL.

[4]  Jianfeng Gao,et al.  DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[5]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[6]  Haoran Huang,et al.  Spelling Error Correction with Soft-Masked BERT , 2020, ACL.

[7]  J. Nie,et al.  VGCN-BERT: Augmenting BERT with Graph Embedding for Text Classification , 2020, ECIR.

[8]  Xuanjing Huang,et al.  K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters , 2020, FINDINGS.

[9]  Wenhan Xiong,et al.  Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model , 2019, ICLR.

[10]  Xiaola Lin,et al.  Aggregating Bidirectional Encoder Representations Using MatchLSTM for Sequence Matching , 2019, EMNLP.

[11]  Peter J. Liu,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[12]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[13]  Zhe Zhao,et al.  K-BERT: Enabling Language Representation with Knowledge Graph , 2019, AAAI.

[14]  Noah A. Smith,et al.  Knowledge Enhanced Contextual Word Representations , 2019, EMNLP.

[15]  Ramesh Nallapati,et al.  Multi-passage BERT: A Globally Normalized BERT Model for Open-domain Question Answering , 2019, EMNLP.

[16]  Hao Tian,et al.  ERNIE 2.0: A Continual Pre-training Framework for Language Understanding , 2019, AAAI.

[17]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[18]  An Yang,et al.  Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine Reading Comprehension , 2019, ACL.

[19]  Jong-Hoon Oh,et al.  Open-Domain Why-Question Answering with Adversarial Learning to Encode Answer Texts , 2019, ACL.

[20]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[21]  Maosong Sun,et al.  ERNIE: Enhanced Language Representation with Informative Entities , 2019, ACL.

[22]  Siddhartha Brahma,et al.  Unsupervised Learning of Sentence Representations Using Sequence Consistency , 2018, ArXiv.

[23]  Zhiyuan Liu,et al.  Denoising Distantly Supervised Open-Domain Question Answering , 2018, ACL.

[24]  Samuel R. Bowman,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[25]  William W. Cohen,et al.  Quasar: Datasets for Question Answering by Search and Reading , 2017, ArXiv.

[26]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[27]  Kyunghyun Cho,et al.  SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine , 2017, ArXiv.

[28]  Jason Weston,et al.  Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.

[29]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[30]  Jürgen Schmidhuber,et al.  Training Very Deep Networks , 2015, NIPS.

[31]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[32]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[33]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[34]  Ian S. Dunn,et al.  Exploring the Limits , 2009 .

[35]  Joshua B. Tenenbaum,et al.  Modelling Relational Data using Bayesian Clustered Tensor Factorization , 2009, NIPS.

[36]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[37]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[38]  Jinsung Yoon,et al.  GENERATIVE ADVERSARIAL NETS , 2018 .

[39]  J. Andrade-Cetto,et al.  Object Recognition , 2014, Computer Vision, A Reference Guide.

[40]  Yoshua Bengio,et al.  Object Recognition with Gradient-Based Learning , 1999, Shape, Contour and Grouping in Computer Vision.