论文信息 - Enriching Biomedical Knowledge for Low-resource Language Through Large-scale Translation - 字舞流文

Enriching Biomedical Knowledge for Low-resource Language Through Large-scale Translation

Biomedical data and benchmarks are highly valuable yet very limited in low-resource languages other than English, such as Vietnamese. In this paper, we use a state-of-the-art translation model in English-Vietnamese to translate and produce both pretrained and supervised data in the biomedical domains. Thanks to such large-scale translation, we introduce ViPubmedT5, a pretrained Encoder-Decoder Transformer model trained on 20 million translated abstracts from the high-quality public PubMed corpus. ViPubMedT5 demonstrates state-of-the-art results on two different biomedical benchmarks in summarization and acronym disambiguation. Further, we release ViMedNLI - a new NLP task in Vietnamese translated from MedNLI using the recently public En-vi translation model and carefully refined by human experts, with evaluations of existing methods against ViPubmedT5.

Trieu H. Trinh | Minh-Thang Luong | Long Phan | Tai Dang | Lam D. Chau | Vy Phan | H. Tran

[1] Minh Le Nguyen,et al. MTet: Multi-domain Translation for English and Vietnamese , 2022, ArXiv.

[2] Shannon L. Spruit,et al. No Language Left Behind: Scaling Human-Centered Machine Translation , 2022, ArXiv.

[3] Muhammad Umair Nasir,et al. Geographical Distance Is The New Hyperparameter: A Case Study Of Finding The Optimal Pre-trained Language For English-isiZulu Machine Translation. , 2022, MIA.

[4] Trieu H. Trinh,et al. ViT5: Pretrained Text-to-Text Transformer for Vietnamese Language Generation , 2022, NAACL.

[5] Ruyi Gan,et al. BioBART: Pretraining and Evaluation of A Biomedical Generative Language Model , 2022, BIONLP.

[6] J. Leskovec,et al. LinkBERT: Pretraining Language Models with Document Links , 2022, ACL.

[7] Hao Cheng,et al. Fine-tuning large neural language models for biomedical natural language processing , 2021, Patterns.

[8] Dat Quoc Nguyen,et al. PhoMT: A High-Quality and Large-Scale Benchmark Dataset for Vietnamese-English Machine Translation , 2021, EMNLP.

[9] Gr'egoire Altan-Bonnet,et al. SciFive: a text-to-text transformer model for biomedical literature , 2021, ArXiv.

[10] Hieu Tran,et al. CoTexT: Multi-task Learning with Code-Text Transformer , 2021, NLP4PROG.

[11] Holger Schwenk,et al. Beyond English-Centric Multilingual Machine Translation , 2020, J. Mach. Learn. Res..

[12] Pierre Zweigenbaum,et al. CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters , 2020, COLING.

[13] Jianfeng Gao,et al. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing , 2020, ACM Trans. Comput. Heal..

[14] Kenneth Heafield,et al. ParaCrawl: Web-Scale Acquisition of Parallel Corpora , 2020, ACL.

[15] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[16] Graham Neubig,et al. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data , 2020, ACL.

[17] Dat Quoc Nguyen,et al. PhoBERT: Pre-trained language models for Vietnamese , 2020, FINDINGS.

[18] Ting Liu,et al. CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, FINDINGS.

[19] Konrad P. Kording,et al. Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset , 2020, J. Open Source Softw..

[20] Marjan Ghazvininejad,et al. Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[21] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[22] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[23] Peng-Jen Chen,et al. Facebook AI’s WAT19 Myanmar-English Translation Task Submission , 2019, EMNLP.

[24] Marc'Aurelio Ranzato,et al. Revisiting Self-Training for Neural Sequence Generation , 2019, ICLR.

[25] Marc'Aurelio Ranzato,et al. The Source-Target Domain Mismatch Problem in Machine Translation , 2019, EACL.

[26] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[27] William W. Cohen,et al. PubMedQA: A Dataset for Biomedical Research Question Answering , 2019, EMNLP.

[28] Vaidheeswaran Archana,et al. Saama Research at MEDIQA 2019: Pre-trained BioBERT with Attention Visualisation for Medical Natural Language Inference , 2019, BioNLP@ACL.

[29] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[30] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[31] Zhiyong Lu,et al. Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets , 2019, BioNLP@ACL.

[32] Jaewoo Kang,et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[33] Alexey Romanov,et al. Lessons from Natural Language Inference in the Clinical Domain , 2018, EMNLP.

[34] Junyi Jessy Li,et al. A Corpus with Multi-Level Annotations of Patients, Interventions and Outcomes to Support Language Processing for Medical Literature , 2018, ACL.

[35] Karin M. Verspoor,et al. Parallel Corpora for the Biomedical Domain , 2018, LREC.

[36] Victor O. K. Li,et al. Universal Neural Machine Translation for Extremely Low Resource Languages , 2018, NAACL.

[37] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[38] Mariana L. Neves,et al. The Scielo Corpus: a Parallel Corpus of Scientific Publications for Biomedicine , 2016, LREC.

[39] Peter Szolovits,et al. MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[40] Masao Utiyama,et al. Introducing the Asian Language Treebank (ALT) , 2016, LREC.

[41] Zhiyong Lu,et al. NCBI disease corpus: A resource for disease name recognition and concept normalization , 2014, J. Biomed. Informatics.

[42] Paloma Martínez,et al. The DDI corpus: An annotated corpus with pharmacological substances and drug-drug interactions , 2013, J. Biomed. Informatics.

[43] Antonio Jimeno-Yepes,et al. Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text , 2013, BMC Bioinformatics.

[44] Pierre Zweigenbaum,et al. Translating medical terminologies through word alignment in parallel text corpora , 2009, J. Biomed. Informatics.

[45] John F. Hurdle,et al. Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[46] Carol Friedman,et al. Two biomedical sublanguages: a description based on the theories of Zellig Harris , 2002, J. Biomed. Informatics.

[47] Minh Le Nguyen,et al. ViHealthBERT: Pre-trained Language Models for Vietnamese in Health Text Mining , 2022, LREC.

[48] Vijay K. Shanker,et al. BioM-Transformers: Building Large Biomedical Language Models with BERT, ALBERT and ELECTRA , 2021, BIONLP.

[49] Thanh-Le Ha,et al. Goals, Challenges and Findings of the VLSP 2020 English-Vietnamese News Translation Shared Task , 2020, VLSP.

[50] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[51] Christopher D. Manning,et al. Stanford Neural Machine Translation Systems for Spoken Language Domains , 2015, IWSLT.

[52] Fei Xia,et al. Statistical machine translation for biomedical text: are we there yet? , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.