AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing

Transformer-based pretrained language models (T-PTLMs) have achieved great success in almost every NLP task. The evolution of these models started with GPT and BERT. These models are built on the top of transformers, self-supervised learning and transfer learning. Transformed-based PTLMs learn universal language representations from large volumes of text data using self-supervised learning and transfer this knowledge to downstream tasks. These models provide good background knowledge to downstream tasks which avoids training of downstream models from scratch. In this comprehensive survey paper, we initially give a brief overview of self-supervised learning. Next, we explain various core concepts like pretraining, pretraining methods, pretraining tasks, embeddings and downstream adaptation methods. Next, we present a new taxonomy of T-PTLMs and then give brief overview of various benchmarks including both intrinsic and extrinsic. We present a summary of various useful libraries to work with T-PTLMs. Finally, we highlight some of the future research directions which will further improve these models. We strongly believe that this comprehensive survey paper will serve as a good reference to learn the core concepts as well as to stay updated with the recent happenings in T-PTLMs. The list of T-PTLMs along with links is available at https://mr-nlp.github.io/posts/2021/05/tptlms-list/

[1]  Aswin Sivaraman,et al.  Self-Supervised Learning from Contrastive Mixtures for Personalized Speech Enhancement , 2020, ArXiv.

[2]  Liang Xu,et al.  CLUECorpus2020: A Large-scale Chinese Corpus for Pre-training Language Model , 2020, ArXiv.

[3]  Colin Raffel,et al.  ByT5: Towards a token-free future with pre-trained byte-to-byte models , 2021, ArXiv.

[4]  Benoît Sagot,et al.  Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures , 2019 .

[5]  Hongxia Yang,et al.  OAG-BERT: Pre-train Heterogeneous Entity-augmented Academic Language Models , 2021, ArXiv.

[6]  D. Tao,et al.  A Survey on Visual Transformer , 2020, ArXiv.

[7]  Kevin Duh,et al.  Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning , 2020, RepL4NLP@ACL.

[8]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[9]  Marcel Salathé,et al.  COVID-Twitter-BERT: A natural language processing model to analyse COVID-19 content on Twitter , 2020, Frontiers in Artificial Intelligence.

[10]  Kaisheng M. Wang,et al.  PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation , 2021, ArXiv.

[11]  Gaurav Menghani,et al.  Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better , 2021, ACM Comput. Surv..

[12]  Xu Tan,et al.  MASS: Masked Sequence to Sequence Pre-training for Language Generation , 2019, ICML.

[13]  Tommaso Caselli,et al.  HateBERT: Retraining BERT for Abusive Language Detection in English , 2020, WOAH.

[14]  Peng Zhou,et al.  Text Classification Improved by Integrating Bidirectional LSTM with Two-dimensional Max Pooling , 2016, COLING.

[15]  Lukasz Kaiser,et al.  Rethinking Attention with Performers , 2020, ArXiv.

[16]  Furu Wei,et al.  Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains , 2021, FINDINGS.

[17]  Jungo Kasai,et al.  GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation , 2021, ArXiv.

[18]  Matt J. Kusner,et al.  A Survey on Contextual Embeddings , 2020, ArXiv.

[19]  Y. Matsumura,et al.  A Pre-Training Technique to Localize Medical BERT and to Enhance Biomedical BERT , 2020 .

[20]  Julian J. McAuley,et al.  Ups and Downs: Modeling the Visual Evolution of Fashion Trends with One-Class Collaborative Filtering , 2016, WWW.

[21]  Xavier Amatriain,et al.  Domain-Relevant Embeddings for Medical Question Similarity , 2019, ArXiv.

[22]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[23]  Yoav Shoham,et al.  SenseBERT: Driving Some Sense into BERT , 2019, ACL.

[24]  Fedor Moiseev,et al.  Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.

[25]  Dat Quoc Nguyen,et al.  PhoBERT: Pre-trained language models for Vietnamese , 2020, FINDINGS.

[26]  Kyle Lo,et al.  S2ORC: The Semantic Scholar Open Research Corpus , 2020, ACL.

[27]  Percy Liang,et al.  Prefix-Tuning: Optimizing Continuous Prompts for Generation , 2021, ACL.

[28]  Phil Blunsom,et al.  A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[29]  Piotr Rybak,et al.  KLEJ: Comprehensive Benchmark for Polish Language Understanding , 2020, ACL.

[30]  Olatunji Ruwase,et al.  DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters , 2020, KDD.

[31]  Xuanjing Huang,et al.  Recurrent Neural Network for Text Classification with Multi-Task Learning , 2016, IJCAI.

[32]  Marius Mosbach,et al.  On the Interplay Between Fine-tuning and Sentence-Level Probing for Linguistic Knowledge in Pre-Trained Transformers , 2020, BLACKBOXNLP.

[33]  Mona Attariyan,et al.  Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[34]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[35]  Graham Neubig,et al.  XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization , 2020, ICML.

[36]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[37]  Tianyu Gao,et al.  SimCSE: Simple Contrastive Learning of Sentence Embeddings , 2021, EMNLP.

[38]  Leonardo Neves,et al.  TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification , 2020, FINDINGS.

[39]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[40]  Hyunjae Lee,et al.  KoreALBERT: Pretraining a Lite BERT Model for Korean Language Understanding , 2021, 2020 25th International Conference on Pattern Recognition (ICPR).

[41]  Gabriel Synnaeve,et al.  CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings , 2021, NeurIPS.

[42]  Roland Vollgraf,et al.  Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[43]  Sarana Nutanong,et al.  WangchanBERTa: Pretraining transformer-based Thai Language Models , 2021, ArXiv.

[44]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[45]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[46]  Mohammad Manthouri,et al.  ParsBERT: Transformer-based Model for Persian Language Understanding , 2020, Neural Processing Letters.

[47]  Anjali Agrawal,et al.  Shuffled-token Detection for Refining Pre-trained RoBERTa , 2021, NAACL.

[48]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[49]  Xiaodong Liu,et al.  Multi-Task Deep Neural Networks for Natural Language Understanding , 2019, ACL.

[50]  Manish Gupta,et al.  Compression of Deep Learning Models for Text: A Survey , 2022, ACM Trans. Knowl. Discov. Data.

[51]  Nigel Collier,et al.  Fast, Effective, and Self-Supervised: Transforming Masked Language Models into Universal Lexical and Sentence Encoders , 2021, EMNLP.

[52]  Maosong Sun,et al.  ERNIE: Enhanced Language Representation with Informative Entities , 2019, ACL.

[53]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[54]  Jennifer J. Liang,et al.  Identification of Semantically Similar Sentences in Clinical Notes: Iterative Intermediate Training Using Multi-Task Learning , 2020, JMIR medical informatics.

[55]  Michal Perelkiewicz,et al.  Pre-training Polish Transformer-based Language Models at Scale , 2020, ICAISC.

[56]  Deniz Yuret,et al.  KU_ai at MEDIQA 2019: Domain-specific Pre-training and Transfer Learning for Medical NLI , 2019, BioNLP@ACL.

[57]  Bridget T. McInnes,et al.  MT-Clinical BERT: Scaling Clinical Information Extraction with Multitask Learning , 2020, J. Am. Medical Informatics Assoc..

[58]  Diyi Yang,et al.  The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics , 2021, GEM.

[59]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[60]  Dat Quoc Nguyen,et al.  BERTweet: A pre-trained language model for English Tweets , 2020, EMNLP.

[61]  Dilek Z. Hakkani-Tür,et al.  DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue , 2020, ArXiv.

[62]  Philipp Dufter,et al.  Multilingual LAMA: Investigating Knowledge in Multilingual Pretrained Language Models , 2021, EACL.

[63]  Iryna Gurevych,et al.  What to Pre-Train on? Efficient Intermediate Task Selection , 2021, EMNLP.

[64]  Jaewoo Kang,et al.  Transferability of Natural Language Inference to Biomedical Question Answering , 2020, CLEF.

[65]  Tapio Salakoski,et al.  Multilingual is not enough: BERT for Finnish , 2019, ArXiv.

[66]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[67]  Alex Wang,et al.  Can You Tell Me How to Get Past Sesame Street? Sentence-Level Pretraining Beyond Language Modeling , 2018, ACL.

[68]  Zhuosheng Zhang,et al.  LIMIT-BERT : Linguistic Informed Multi-Task BERT , 2020, EMNLP.

[69]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[70]  Hieu Tran,et al.  CoTexT: Multi-task Learning with Code-Text Transformer , 2021, NLP4PROG.

[71]  Marco Basaldella,et al.  Self-alignment Pre-training for Biomedical Entity Representations , 2020, ArXiv.

[72]  Edouard Grave,et al.  Reducing Transformer Depth on Demand with Structured Dropout , 2019, ICLR.

[73]  Yang Zhang,et al.  Bio-Megatron: Larger Biomedical Domain Language Model , 2020, EMNLP.

[74]  Moshe Wasserblat,et al.  Q8BERT: Quantized 8Bit BERT , 2019, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).

[75]  Furu Wei,et al.  MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers , 2020, NeurIPS.

[76]  Hiroyuki Shindo,et al.  Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation , 2016, CoNLL.

[77]  Michael W. Mahoney,et al.  Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT , 2019, AAAI.

[78]  Roberto de Alencar Lotufo,et al.  BERTimbau: Pretrained BERT Models for Brazilian Portuguese , 2020, BRACIS.

[79]  Hinrich Schütze,et al.  Rare Words: A Major Problem for Contextualized Embeddings And How to Fix it by Attentive Mimicking , 2019, AAAI.

[80]  Kyle Lo,et al.  FLEX: Unifying Evaluation for Few-Shot NLP , 2021, NeurIPS.

[81]  Ming Zhou,et al.  InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training , 2021, NAACL.

[82]  Helen Chen,et al.  UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus , 2021, NAACL.

[83]  Xipeng Qiu,et al.  Pre-trained models for natural language processing: A survey , 2020, Science China Technological Sciences.

[84]  Martin Malmsten,et al.  Playing with Words at the National Library of Sweden - Making a Swedish BERT , 2020, ArXiv.

[85]  Neel Sundaresan,et al.  CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation , 2021, NeurIPS Datasets and Benchmarks.

[86]  Dian Yu,et al.  CLUE: A Chinese Language Understanding Evaluation Benchmark , 2020, COLING.

[87]  Hazem Hajj,et al.  AraBERT: Transformer-based Model for Arabic Language Understanding , 2020, OSACT.

[88]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[89]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[90]  Anna Korhonen,et al.  Specializing Unsupervised Pretraining Models for Word-Level Semantic Similarity , 2019, COLING.

[91]  Kazem Rahimi,et al.  BEHRT: Transformer for Electronic Health Records , 2019, Scientific Reports.

[92]  Yu Cheng,et al.  Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.

[93]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[94]  Shuicheng Yan,et al.  ConvBERT: Improving BERT with Span-based Dynamic Convolution , 2020, NeurIPS.

[95]  Qingyu Chen,et al.  An Empirical Study of Multi-Task Learning on BERT for Biomedical Text Mining , 2020, BIONLP.

[96]  Vivek Srikumar,et al.  A Closer Look at How Fine-tuning Changes BERT , 2021, ArXiv.

[97]  Xiaodong Liu,et al.  Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing , 2020, ACM Trans. Comput. Heal..

[98]  Li Dong,et al.  Cross-Lingual Natural Language Generation via Pre-Training , 2020, AAAI.

[99]  Guoao Wei,et al.  FewCLUE: A Chinese Few-shot Learning Evaluation Benchmark , 2021, ArXiv.

[100]  Tommaso Caselli,et al.  BERTje: A Dutch BERT Model , 2019, ArXiv.

[101]  Qun Liu,et al.  TernaryBERT: Distillation-aware Ultra-low Bit BERT , 2020, EMNLP.

[102]  Roy Schwartz,et al.  Knowledge Enhanced Contextual Word Representations , 2019, EMNLP/IJCNLP.

[103]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[104]  Ting Liu,et al.  CausalBERT: Injecting Causal Knowledge Into Pre-trained Models with Minimal Supervision , 2021, ArXiv.

[105]  Alena Fenogenova,et al.  RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark , 2020, EMNLP.

[106]  Ke Xu,et al.  Investigating Learning Dynamics of BERT Fine-Tuning , 2020, AACL.

[107]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[108]  Yi Yang,et al.  FinBERT: A Pretrained Language Model for Financial Communications , 2020, ArXiv.

[109]  Antoine NetBERT: A Pre-trained Language Representation Model for Computer Networking , 2020 .

[110]  Preslav Nakov,et al.  Poor Man's BERT: Smaller and Faster Transformer Models , 2020, ArXiv.

[111]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[112]  Marcin Junczys-Dowmunt,et al.  The United Nations Parallel Corpus v1.0 , 2016, LREC.

[113]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[114]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[115]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[116]  Amaru Cuba Gyllensten,et al.  Semantic Re-tuning with Contrastive Tension , 2021, ICLR.

[117]  Taku Kudo,et al.  Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[118]  Evangelos Kanoulas,et al.  A Benchmark for Lease Contract Review , 2020, ArXiv.

[119]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[120]  Noah A. Smith,et al.  Variational Pretraining for Semi-supervised Text Classification , 2019, ACL.

[121]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[122]  Lukasz Kaiser,et al.  Universal Transformers , 2018, ICLR.

[123]  Myle Ott,et al.  Larger-Scale Transformers for Multilingual Masked Language Modeling , 2021, REPL4NLP.

[124]  Qun Liu,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2020, EMNLP.

[125]  Noam Shazeer,et al.  Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, ArXiv.

[126]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[127]  Alexei Baevski,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[128]  Pengtao Xie,et al.  CERT: Contrastive Self-supervised Learning for Language Understanding , 2020, ArXiv.

[129]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[130]  Zhengyun Zhao,et al.  CODER: Knowledge-infused cross-lingual medical term embedding for term normalization , 2020, J. Biomed. Informatics.

[131]  Benjamin Lecouteux,et al.  FlauBERT: Unsupervised Language Model Pre-training for French , 2020, LREC.

[132]  Ayu Purwarianti,et al.  IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding , 2020, AACL.

[133]  Hinrich Schutze,et al.  Negated LAMA: Birds cannot fly , 2019, ArXiv.

[134]  Atreyee Dey,et al.  MuRIL: Multilingual Representations for Indian Languages , 2021, ArXiv.

[135]  Anindya Iqbal,et al.  BanglaBERT: Combating Embedding Barrier for Low-Resource Language Understanding , 2021, ArXiv.

[136]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[137]  Giovanni Semeraro,et al.  AlBERTo: Italian BERT Language Understanding Model for NLP Challenging Tasks Based on Tweets , 2019, CLiC-it.

[138]  Jinlan Fu,et al.  XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation , 2021, EMNLP.

[139]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[140]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[141]  Erik T. Mueller,et al.  Open Mind Common Sense: Knowledge Acquisition from the General Public , 2002, OTM.

[142]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[143]  Fan Yang,et al.  XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation , 2020, EMNLP.

[144]  Sebastian Riedel,et al.  Language Models as Knowledge Bases? , 2019, EMNLP.

[145]  Guoming Zhang,et al.  Cloud-based intelligent self-diagnosis and department recommendation service using Chinese medical BERT , 2021, J. Cloud Comput..

[146]  Holger Schwenk,et al.  WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia , 2019, EACL.

[147]  Ahmadreza Mosallanezhad,et al.  ParsiNLU: A Suite of Language Understanding Challenges for Persian , 2020, Transactions of the Association for Computational Linguistics.

[148]  Lutfi Kerem Senel,et al.  Does She Wink or Does She Nod? A Challenging Benchmark for Evaluating Word Understanding of Language Models , 2021, EACL.

[149]  Ziqian Xie,et al.  Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction , 2020, npj Digital Medicine.

[150]  Ming Zhou,et al.  GraphCodeBERT: Pre-training Code Representations with Data Flow , 2020, ICLR.

[151]  Kevin Duh,et al.  Membership Inference Attacks on Sequence-to-Sequence Models: Is My Data In Your Machine Translation System? , 2019, TACL.

[152]  Ulli Waltinger,et al.  Inexpensive Domain Adaptation of Pretrained Language Models: Case Studies on Biomedical NER and Covid-19 QA , 2020, EMNLP.

[153]  Yoshua Bengio,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2010, AISTATS.

[154]  Furu Wei,et al.  mT6: Multilingual Pretrained Text-to-Text Transformer with Translation Pairs , 2021, EMNLP.

[155]  Graham Neubig,et al.  X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained Language Models , 2020, EMNLP.

[156]  Ruslan Salakhutdinov,et al.  Towards Understanding and Mitigating Social Biases in Language Models , 2021, ICML.

[157]  Xingyi Cheng,et al.  Dual-View Distilled BERT for Sentence Embedding , 2021, SIGIR.

[158]  Hazem Hajj,et al.  AraGPT2: Pre-Trained Transformer for Arabic Language Generation , 2021, WANLP.

[159]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[160]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[161]  Xiaocheng Feng,et al.  CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, EMNLP.

[162]  Hai-Tao Zheng,et al.  CLINE: Contrastive Learning with Semantic Negative Examples for Natural Language Understanding , 2021, ACL.

[163]  Thamar Solorio,et al.  LinCE: A Centralized Benchmark for Linguistic Code-switching Evaluation , 2020, LREC.

[164]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[165]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[166]  Timothy Baldwin,et al.  IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP , 2020, COLING.

[167]  Fuzheng Zhang,et al.  ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer , 2021, ACL.

[168]  Pushpak Bhattacharyya,et al.  The IIT Bombay English-Hindi Parallel Corpus , 2017, LREC.

[169]  Graham Neubig,et al.  How Can We Know What Language Models Know? , 2019, Transactions of the Association for Computational Linguistics.

[170]  Qianghuai Jia,et al.  Conceptualized Representation Learning for Chinese Biomedical Text Mining , 2020, ArXiv.

[171]  Muhammad Abdul-Mageed,et al.  ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic , 2020, ACL.

[172]  Barry Haddow,et al.  PMIndia - A Collection of Parallel Corpora of Languages of India , 2020, ArXiv.

[173]  Brian Lester,et al.  The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[174]  Nadir Durrani,et al.  Analyzing Redundancy in Pretrained Transformer Models , 2020, EMNLP.

[175]  Elahe Rahimtoroghi,et al.  What Happens To BERT Embeddings During Fine-tuning? , 2020, BLACKBOXNLP.

[176]  Ali Farhadi,et al.  Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping , 2020, ArXiv.

[177]  Yin Yang,et al.  Compressing Large-Scale Transformer-Based Models: A Case Study on BERT , 2020, Transactions of the Association for Computational Linguistics.

[178]  Weizhe Yuan,et al.  BARTScore: Evaluating Generated Text as Text Generation , 2021, NeurIPS.

[179]  Luo Si,et al.  StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding , 2019, ICLR.

[180]  Wei-Hung Weng,et al.  Publicly Available Clinical BERT Embeddings , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[181]  Ion Androutsopoulos,et al.  LEGAL-BERT: “Preparing the Muppets for Court’” , 2020, FINDINGS.

[182]  Jiancheng Lv,et al.  GLGE: A New General Language Generation Evaluation Benchmark , 2021, FINDINGS.

[183]  Philip S. Yu,et al.  Adv-BERT: BERT is not robust on misspellings! Generating nature adversarial samples on BERT , 2020, ArXiv.

[184]  Yen-Pin Chen,et al.  Modified Bidirectional Encoder Representations From Transformers Extractive Summarization Model for Hospital Information Systems Based on Character-Level Tokens (AlphaBERT): Development and Performance Evaluation , 2020, JMIR medical informatics.

[185]  Dan Roth,et al.  Extending Multilingual BERT to Low-Resource Languages , 2020, FINDINGS.

[186]  Colin Raffel,et al.  mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2021, NAACL.

[187]  Jie Zhou,et al.  SentiX: A Sentiment-Aware Pre-Trained Model for Cross-Domain Sentiment Analysis , 2020, COLING.

[188]  Dogu Araci,et al.  FinBERT: Financial Sentiment Analysis with Pre-trained Language Models , 2019, ArXiv.

[189]  Ming Zhou,et al.  Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks , 2019, EMNLP.

[190]  Taranjit Kaur,et al.  Automated Brain Image Classification Based on VGG-16 and Transfer Learning , 2019, 2019 International Conference on Information Technology (ICIT).

[191]  Iain Murray,et al.  BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning , 2019, ICML.

[192]  Mitesh M. Khapra,et al.  iNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages , 2020, FINDINGS.

[193]  Aline Villavicencio,et al.  The brWaC Corpus: A New Open Resource for Brazilian Portuguese , 2018, LREC.

[194]  Jie Tang,et al.  Self-Supervised Learning: Generative or Contrastive , 2020, IEEE Transactions on Knowledge and Data Engineering.

[195]  Christo Kirov,et al.  Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset , 2020, LREC.

[196]  William Speier,et al.  Bidirectional Representation Learning From Transformers Using Multimodal Electronic Health Record Data to Predict Depression , 2021, IEEE Journal of Biomedical and Health Informatics.

[197]  Timothy Baldwin,et al.  Learning from Unlabelled Data for Clinical Semantic Textual Similarity , 2020, CLINICALNLP.

[198]  Wanxiang Che,et al.  TextBrewer: An Open-Source Knowledge Distillation Toolkit for Natural Language Processing , 2020, ACL.

[199]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[200]  Iryna Gurevych,et al.  How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models , 2021, ACL/IJCNLP.

[201]  Li Yang,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[202]  Beliz Gunel,et al.  Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning , 2020, ICLR.

[203]  Li Yang,et al.  ETC: Encoding Long and Structured Inputs in Transformers , 2020, EMNLP.

[204]  Samuel R. Bowman,et al.  Intermediate-Task Transfer Learning with Pretrained Language Models: When and Why Does It Work? , 2020, ACL.

[205]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[206]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[207]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[208]  Mingxuan Wang,et al.  LightSeq: A High Performance Inference Library for Transformers , 2021, NAACL.

[209]  Sebastian Gehrmann,et al.  exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformers Models , 2019, ArXiv.

[210]  John Wieting,et al.  CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation , 2021, ArXiv.

[211]  Hiroaki Hayashi,et al.  Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing , 2021, ACM Comput. Surv..

[212]  Zhengxiao Du,et al.  GPT Understands, Too , 2021, AI Open.

[213]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[214]  Colin Raffel,et al.  Extracting Training Data from Large Language Models , 2020, USENIX Security Symposium.

[215]  Henghui Zhu,et al.  Enhancing Clinical BERT Embedding using a Biomedical Knowledge Base , 2020, COLING.

[216]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[217]  Jinlan Fu,et al.  ExplainaBoard: An Explainable Leaderboard for NLP , 2021, ACL.

[218]  Vedant Misra,et al.  BLACK BOX ATTACKS ON TRANSFORMER LANGUAGE MODELS , 2019 .

[219]  Pierre Zweigenbaum,et al.  CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters , 2020, COLING.

[220]  Het Shah,et al.  KD-Lib: A PyTorch library for Knowledge Distillation, Pruning and Quantization , 2020, ArXiv.

[221]  Furu Wei,et al.  XLM-E: Cross-lingual Language Model Pre-training via ELECTRA , 2021, ArXiv.

[222]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[223]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[224]  Hazem Hajj,et al.  AraELECTRA: Pre-Training Text Discriminators for Arabic Language Understanding , 2020, ArXiv.

[225]  Morteza Ziyadi,et al.  MT-BioNER: Multi-task Learning for Biomedical Named Entity Recognition using Deep Bidirectional Transformers , 2020, ArXiv.

[226]  Zhiyong Lu,et al.  Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets , 2019, BioNLP@ACL.

[227]  Luis Espinosa Anke,et al.  XLM-T: A Multilingual Language Model Toolkit for Twitter , 2021, ArXiv.

[228]  Xipeng Qiu,et al.  A Survey of Transformers , 2021, AI Open.

[229]  Sha Yuan,et al.  WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models , 2021, AI Open.

[230]  Ting Liu,et al.  CharBERT: Character-aware Pre-trained Language Model , 2020, COLING.

[231]  Zhiyuan Liu,et al.  CPM-2: Large-scale Cost-effective Pre-trained Language Models , 2021, AI Open.

[232]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[233]  Kyunghyun Cho,et al.  Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models , 2020, ICLR.

[234]  Zhen Qin,et al.  Charformer: Fast Character Transformers via Gradient-based Subword Tokenization , 2021, ArXiv.

[235]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[236]  Jugal Kalita,et al.  Multi-task learning for natural language processing in the 2020s: where are we going? , 2020, Pattern Recognit. Lett..

[237]  Shuangzhi Wu,et al.  Alternating Language Modeling for Cross-Lingual Pre-Training , 2020, AAAI.

[238]  Yonghui Wu,et al.  Measurement of Semantic Textual Similarity in Clinical Texts: Comparison of Transformer-Based Models , 2020, JMIR medical informatics.

[239]  Edouard Grave,et al.  Training with Quantization Noise for Extreme Model Compression , 2020, ICLR.

[240]  Christopher M. Danforth,et al.  Interpretable Bias Mitigation for Textual Data: Reducing Genderization in Patient Notes While Maintaining Classification Performance , 2021, ACM Trans. Comput. Heal..

[241]  Sampo Pyysalo,et al.  The birth of Romanian BERT , 2020, FINDINGS.

[242]  Mikhail Arkhipov,et al.  Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language , 2019, ArXiv.

[243]  Monojit Choudhury,et al.  GLUECoS: An Evaluation Benchmark for Code-Switched NLP , 2020, ACL.

[244]  Omer Levy,et al.  SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[245]  Alexey Sorokin,et al.  Tuning Multilingual Transformers for Language-Specific Named Entity Recognition , 2019, BSNLP@ACL.

[246]  Kwan Hui Lim,et al.  An Unsupervised Sentence Embedding Method by Mutual Information Maximization , 2020, EMNLP.

[247]  Zhi Tang,et al.  MathBERT: A Pre-Trained Model for Mathematical Formula Understanding , 2021, ArXiv.

[248]  Danqi Chen,et al.  Making Pre-trained Language Models Better Few-shot Learners , 2021, ACL/IJCNLP.

[249]  Dacheng Tao,et al.  A Survey on Multi-view Learning , 2013, ArXiv.

[250]  Zhiyuan Liu,et al.  Knowledge Inheritance for Pre-trained Language Models , 2021, ArXiv.

[251]  Yiming Yang,et al.  MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , 2020, ACL.

[252]  Diedre Carmo,et al.  PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data , 2020, ArXiv.

[253]  Anoop Kunchukuttan,et al.  Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages , 2021, ArXiv.

[254]  El Moatez Billah Nagoudi,et al.  IndT5: A Text-to-Text Transformer for 10 Indigenous Languages , 2021, AMERICASNLP.

[255]  Hinrich Schutze,et al.  It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners , 2020, NAACL.

[256]  Anna Rumshisky,et al.  Revealing the Dark Secrets of BERT , 2019, EMNLP.

[257]  Kai-Wei Chang,et al.  Unified Pre-training for Program Understanding and Generation , 2021, NAACL.

[258]  Alice H. Oh,et al.  KLUE: Korean Language Understanding Evaluation , 2021, NeurIPS Datasets and Benchmarks.

[259]  Bill Yuchen Lin,et al.  Common Sense Beyond English: Evaluating and Improving Multilingual Language Models for Commonsense Reasoning , 2021, ACL.

[260]  Vishrav Chaudhary,et al.  CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , 2019, LREC.

[261]  Marius Mosbach,et al.  On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines , 2020, ArXiv.

[262]  Jianmo Ni,et al.  Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects , 2019, EMNLP.

[263]  Ali Farhadi,et al.  Defending Against Neural Fake News , 2019, NeurIPS.

[264]  Xiaodong Liu,et al.  Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[265]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[266]  Sameer Singh,et al.  Eliciting Knowledge from Language Models Using Automatically Generated Prompts , 2020, EMNLP.

[267]  Philipp Koehn,et al.  A Massive Collection of Cross-Lingual Web-Document Pairs , 2019, EMNLP.

[268]  Yao Zhao,et al.  PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization , 2020, ICML.

[269]  Bettina Berendt,et al.  RobBERT: a Dutch RoBERTa-based Language Model , 2020, FINDINGS.

[270]  Davis Liang,et al.  Masked Language Model Scoring , 2019, ACL.

[271]  Sameer Singh,et al.  Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[272]  Milan Straka,et al.  RobeCzech: Czech RoBERTa, a monolingual contextualized language representation model , 2021, TDS.

[273]  Nan Duan,et al.  FastSeq: Make Sequence Generation Faster , 2021, ACL.

[274]  Andreas Moshovos,et al.  GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference , 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[275]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[276]  Ulli Waltinger,et al.  BERT is Not a Knowledge Base (Yet): Factual Knowledge vs. Name-Based Reasoning in Unsupervised QA , 2019, ArXiv.

[277]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[278]  Jaewoo Kang,et al.  Pre-trained Language Model for Biomedical Question Answering , 2019, PKDD/ECML Workshops.

[279]  Iryna Gurevych,et al.  TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning , 2021, EMNLP.

[280]  James Demmel,et al.  Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[281]  Catherine Havasi,et al.  Representing General Relational Knowledge in ConceptNet 5 , 2012, LREC.

[282]  Richard Socher,et al.  TOD-BERT: Pre-trained Natural Language Understanding for Task-Oriented Dialogue , 2020, EMNLP.

[283]  Alessandro Moschitti,et al.  Efficient pre-training objectives for Transformers , 2021, ArXiv.

[284]  Jianfeng Gao,et al.  DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[285]  Yang Yu,et al.  TurboTransformers: an efficient GPU serving system for transformer models , 2020, PPoPP.

[286]  Christophe Gravier,et al.  T-REx: A Large Scale Alignment of Natural Language with Knowledge Base Triples , 2018, LREC.

[287]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[288]  Fahad Shahbaz Khan,et al.  Transformers in Vision: A Survey , 2021, ACM Comput. Surv..

[289]  Orhan Firat,et al.  GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[290]  Samuel R. Bowman,et al.  Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks , 2018, ArXiv.

[291]  Laurent Romary,et al.  CamemBERT: a Tasty French Language Model , 2019, ACL.

[292]  Minlie Huang,et al.  SentiLARE: Sentiment-Aware Language Representation Learning with Linguistic Knowledge , 2020, EMNLP.

[293]  Masayu Leylia Khodra,et al.  IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation , 2021, EMNLP.

[294]  Osamu Abe,et al.  KART: Privacy Leakage Framework of Language Models Pre-trained with Clinical Records , 2021, ArXiv.

[295]  Bhuwan Dhingra,et al.  Combating Adversarial Misspellings with Robust Word Recognition , 2019, ACL.

[296]  Douglas Eck,et al.  Deduplicating Training Data Makes Language Models Better , 2021, ArXiv.

[297]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[298]  H. T. Kung,et al.  exBERT: Extending Pre-trained Models with Domain-specific Vocabulary Under Constrained Training Resources , 2020, FINDINGS.