Building Chinese Biomedical Language Models via Multi-Level Text Discrimination

Pre-trained language models (PLMs), such as BERT and GPT, have revolutionized the field of NLP, not only in the general domain but also in the biomedical domain. Most prior efforts in building biomedical PLMs have resorted simply to domain adaptation and focused mainly on English. In this work we introduce eHealth, a Chinese biomedical PLM built from scratch with a new pre-training framework. This new framework pre-trains eHealth as a discriminator through both token- and sequence-level discrimination. The former is to detect input tokens corrupted by a generator and recover their original identities from plausible candidates, while the latter is to further distinguish corruptions of a same original sequence from those of others. As such, eHealth can learn language semantics at both token and sequence levels. Extensive experiments on 11 Chinese biomedical language understanding tasks of various forms verify the effectiveness and superiority of our approach. We release the pre-trained model at \url{https://github.com/PaddlePaddle/Research/tree/master/KG/eHealth} and will also release the code later.

[1]  Zhifang Sui,et al.  CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark , 2021, ACL.

[2]  Chengyu Wang,et al.  SMedBERT: A Knowledge-Enhanced Pre-trained Language Model with Structured Semantics for Medical Text Mining , 2021, ACL.

[3]  Jiawei Han,et al.  Training ELECTRA Augmented with Multi-word Selection , 2021, FINDINGS.

[4]  Jinxuan Yang,et al.  Semantic categorization of Chinese eligibility criteria in clinical trials using machine learning methods , 2021, BMC Medical Informatics and Decision Making.

[5]  Paul N. Bennett,et al.  COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining , 2021, NeurIPS.

[6]  Chengyu Wang,et al.  EMBERT: A Pre-trained Language Model for Chinese Medical Text Mining , 2021, APWeb/WAIM.

[7]  Vijay K. Shanker,et al.  BioM-Transformers: Building Large Biomedical Language Models with BERT, ALBERT and ELECTRA , 2021, BIONLP.

[8]  Malaikannan Sankarasubbu,et al.  BioELECTRA:Pretrained Biomedical text Encoder using Discriminators , 2021, BIONLP.

[9]  Songfang Huang,et al.  Biomedical Question Answering: A Comprehensive Review , 2021, ArXiv.

[10]  Qingcai Chen,et al.  Towards Medical Machine Reading Comprehension with Structural Knowledge and Plain Text , 2020, EMNLP.

[11]  Veselin Stoyanov,et al.  Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art , 2020, CLINICALNLP.

[12]  Hongying Zan,et al.  CMeIE: Construction and Evaluation of Chinese Medical Information Extraction Dataset , 2020, NLPCC.

[13]  Ion Androutsopoulos,et al.  LEGAL-BERT: “Preparing the Muppets for Court’” , 2020, FINDINGS.

[14]  Frank Rudzicz,et al.  On Losses for Modern Language Models , 2020, EMNLP.

[15]  Kangping Yin,et al.  Conceptualized Representation Learning for Chinese Biomedical Text Mining , 2020, ArXiv.

[16]  Jun Zhao,et al.  FinBERT: A Pre-trained Financial Language Representation Model for Financial Text Mining , 2020, IJCAI.

[17]  Tie-Yan Liu,et al.  MC-BERT: Efficient Language Pre-Training via a Meta Controller , 2020, ArXiv.

[18]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[19]  Wanxiang Che,et al.  Revisiting Pre-Trained Models for Chinese Natural Language Processing , 2020, FINDINGS.

[20]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[21]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[22]  Ming-Wei Chang,et al.  REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.

[23]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[24]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[25]  Hao Tian,et al.  ERNIE 2.0: A Continual Pre-training Framework for Language Understanding , 2019, AAAI.

[26]  Omer Levy,et al.  SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[27]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[28]  Pengtao Xie,et al.  CERT: Contrastive Self-supervised Learning for Language Understanding , 2020, ArXiv.

[29]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[30]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[31]  Manshu Tu,et al.  Applying deep matching networks to Chinese medical question answering: a study and a dataset , 2019, BMC Medical Informatics and Decision Making.

[32]  Wei-Hung Weng,et al.  Publicly Available Clinical BERT Embeddings , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[33]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[34]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[35]  Yu Sun,et al.  ERNIE 2 . 0 : A CONTINUAL PRE-TRAINING FRAMEWORK FOR LANGUAGE UNDERSTANDING , 2019 .

[36]  Jun Zhao,et al.  Extracting Relational Facts by an End-to-End Neural Model with Copy Mechanism , 2018, ACL.

[37]  Chris Develder,et al.  Joint entity recognition and relation extraction as a multi-head selection problem , 2018, Expert Syst. Appl..

[38]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[39]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[40]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[41]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[42]  L. Rossetti A Simple Framework , 2015 .

[43]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[44]  R. Cooper Design Challenges , 2007 .