Enriching Biomedical Knowledge for Low-resource Language Through Large-scale Translation

Biomedical data and benchmarks are highly valuable yet very limited in low-resource languages other than English, such as Vietnamese. In this paper, we use a state-of-the-art translation model in English-Vietnamese to translate and produce both pretrained and supervised data in the biomedical domains. Thanks to such large-scale translation, we introduce ViPubmedT5, a pretrained Encoder-Decoder Transformer model trained on 20 million translated abstracts from the high-quality public PubMed corpus. ViPubMedT5 demonstrates state-of-the-art results on two different biomedical benchmarks in summarization and acronym disambiguation. Further, we release ViMedNLI - a new NLP task in Vietnamese translated from MedNLI using the recently public En-vi translation model and carefully refined by human experts, with evaluations of existing methods against ViPubmedT5.

[1]  Minh Le Nguyen,et al.  MTet: Multi-domain Translation for English and Vietnamese , 2022, ArXiv.

[2]  Shannon L. Spruit,et al.  No Language Left Behind: Scaling Human-Centered Machine Translation , 2022, ArXiv.

[3]  Muhammad Umair Nasir,et al.  Geographical Distance Is The New Hyperparameter: A Case Study Of Finding The Optimal Pre-trained Language For English-isiZulu Machine Translation. , 2022, MIA.

[4]  Trieu H. Trinh,et al.  ViT5: Pretrained Text-to-Text Transformer for Vietnamese Language Generation , 2022, NAACL.

[5]  Ruyi Gan,et al.  BioBART: Pretraining and Evaluation of A Biomedical Generative Language Model , 2022, BIONLP.

[6]  J. Leskovec,et al.  LinkBERT: Pretraining Language Models with Document Links , 2022, ACL.

[7]  Hao Cheng,et al.  Fine-tuning large neural language models for biomedical natural language processing , 2021, Patterns.

[8]  Dat Quoc Nguyen,et al.  PhoMT: A High-Quality and Large-Scale Benchmark Dataset for Vietnamese-English Machine Translation , 2021, EMNLP.

[9]  Gr'egoire Altan-Bonnet,et al.  SciFive: a text-to-text transformer model for biomedical literature , 2021, ArXiv.

[10]  Hieu Tran,et al.  CoTexT: Multi-task Learning with Code-Text Transformer , 2021, NLP4PROG.

[11]  Holger Schwenk,et al.  Beyond English-Centric Multilingual Machine Translation , 2020, J. Mach. Learn. Res..

[12]  Pierre Zweigenbaum,et al.  CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters , 2020, COLING.

[13]  Jianfeng Gao,et al.  Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing , 2020, ACM Trans. Comput. Heal..

[14]  Kenneth Heafield,et al.  ParaCrawl: Web-Scale Acquisition of Parallel Corpora , 2020, ACL.

[15]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[16]  Graham Neubig,et al.  TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data , 2020, ACL.

[17]  Dat Quoc Nguyen,et al.  PhoBERT: Pre-trained language models for Vietnamese , 2020, FINDINGS.

[18]  Ting Liu,et al.  CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, FINDINGS.

[19]  Konrad P. Kording,et al.  Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset , 2020, J. Open Source Softw..

[20]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[21]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[22]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[23]  Peng-Jen Chen,et al.  Facebook AI’s WAT19 Myanmar-English Translation Task Submission , 2019, EMNLP.

[24]  Marc'Aurelio Ranzato,et al.  Revisiting Self-Training for Neural Sequence Generation , 2019, ICLR.

[25]  Marc'Aurelio Ranzato,et al.  The Source-Target Domain Mismatch Problem in Machine Translation , 2019, EACL.

[26]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[27]  William W. Cohen,et al.  PubMedQA: A Dataset for Biomedical Research Question Answering , 2019, EMNLP.

[28]  Vaidheeswaran Archana,et al.  Saama Research at MEDIQA 2019: Pre-trained BioBERT with Attention Visualisation for Medical Natural Language Inference , 2019, BioNLP@ACL.

[29]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[30]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[31]  Zhiyong Lu,et al.  Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets , 2019, BioNLP@ACL.

[32]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[33]  Alexey Romanov,et al.  Lessons from Natural Language Inference in the Clinical Domain , 2018, EMNLP.

[34]  Junyi Jessy Li,et al.  A Corpus with Multi-Level Annotations of Patients, Interventions and Outcomes to Support Language Processing for Medical Literature , 2018, ACL.

[35]  Karin M. Verspoor,et al.  Parallel Corpora for the Biomedical Domain , 2018, LREC.

[36]  Victor O. K. Li,et al.  Universal Neural Machine Translation for Extremely Low Resource Languages , 2018, NAACL.

[37]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[38]  Mariana L. Neves,et al.  The Scielo Corpus: a Parallel Corpus of Scientific Publications for Biomedicine , 2016, LREC.

[39]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[40]  Masao Utiyama,et al.  Introducing the Asian Language Treebank (ALT) , 2016, LREC.

[41]  Zhiyong Lu,et al.  NCBI disease corpus: A resource for disease name recognition and concept normalization , 2014, J. Biomed. Informatics.

[42]  Paloma Martínez,et al.  The DDI corpus: An annotated corpus with pharmacological substances and drug-drug interactions , 2013, J. Biomed. Informatics.

[43]  Antonio Jimeno-Yepes,et al.  Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text , 2013, BMC Bioinformatics.

[44]  Pierre Zweigenbaum,et al.  Translating medical terminologies through word alignment in parallel text corpora , 2009, J. Biomed. Informatics.

[45]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[46]  Carol Friedman,et al.  Two biomedical sublanguages: a description based on the theories of Zellig Harris , 2002, J. Biomed. Informatics.

[47]  Minh Le Nguyen,et al.  ViHealthBERT: Pre-trained Language Models for Vietnamese in Health Text Mining , 2022, LREC.

[48]  Vijay K. Shanker,et al.  BioM-Transformers: Building Large Biomedical Language Models with BERT, ALBERT and ELECTRA , 2021, BIONLP.

[49]  Thanh-Le Ha,et al.  Goals, Challenges and Findings of the VLSP 2020 English-Vietnamese News Translation Shared Task , 2020, VLSP.

[50]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[51]  Christopher D. Manning,et al.  Stanford Neural Machine Translation Systems for Spoken Language Domains , 2015, IWSLT.

[52]  Fei Xia,et al.  Statistical machine translation for biomedical text: are we there yet? , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.