A clinical specific BERT developed with huge size of Japanese clinical narrative

Generalized language models that pre-trained with a large corpus have achieved great performance on natural language tasks. While many pre-trained transformers for English are published, few models are available for Japanese text, especially in clinical medicine. In this work, we demonstrate a development of a clinical specific BERT model with a huge size of Japanese clinical narrative and evaluated it on the NTCIR-13 MedWeb that has pseudo-Twitter messages about medical concerns with eight labels. Approximately 120 millions of clinical text stored at the University of Tokyo Hospital were used as dataset. The BERT-base was pre-trained with the entire dataset and a vocabulary including 25,000 tokens. The pre-training was almost saturated at about 4 epochs, and the accuracies of Masked LM and Next Sentence Prediction were 0.773 and 0.975, respectively. The developed BERT tends to show higher performances on the MedWeb task than the other nonspecific BERTs, however, no significant differences were found. The advantage of training on domain-specific texts may become apparent in the more complex tasks on actual clinical text, and such corpus for the evaluation is required to be developed.

[1]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[2]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[3]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[4]  Wei-Hung Weng,et al.  Publicly Available Clinical BERT Embeddings , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[5]  Shoko Wakamiya,et al.  Overview of the NTCIR-13: MedWeb Task , 2017, NTCIR.

[6]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[7]  Jesse Vig Visualizing Attention in Transformer-Based Language models , 2019 .

[8]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[9]  Eiji Aramaki,et al.  J-MeDic: A Japanese Disease Name Dictionary based on Real Clinical Usage , 2018, LREC.

[10]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[11]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[12]  Daisuke Kawahara,et al.  A Fully-Lexicalized Probabilistic Model for Japanese Syntactic and Case Structure Analysis (Special Issue : "Collection of Best Annual Papers" Organized for the 20th Anniversary of the Association for Natural Language Processing) , 2006 .

[13]  Taku Kudo,et al.  MeCab : Yet Another Part-of-Speech and Morphological Analyzer , 2005 .

[14]  Jesse Vig,et al.  Visualizing Attention in Transformer-Based Language Representation Models , 2019, ArXiv.

[15]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[16]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[17]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[18]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[19]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[20]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..