Extracting comprehensive clinical information for breast cancer using deep learning methods

OBJECTIVE Breast cancer is the most common malignant tumor among women. The diagnosis and treatment information of breast cancer patients is abundant in multiple types of clinical fields, including clinicopathological data, genotype and phenotype information, treatment information, and prognosis information. However, current studies are mainly focused on extracting information from one specific type of clinical field. This study defines a comprehensive information model to represent the whole-course clinical information of patients. Furthermore, deep learning approaches are used to extract the concepts and their attributes from clinical breast cancer documents by fine-tuning pretrained Bidirectional Encoder Representations from Transformers (BERT) language models. MATERIALS AND METHODS The clinical corpus that was used in this study was from one 3A cancer hospital in China, consisting of the encounter notes, operation records, pathology notes, radiology notes, progress notes and discharge summaries of 100 breast cancer patients. Our system consists of two components: a named entity recognition (NER) component and a relation recognition component. For each component, we implemented deep learning-based approaches by fine-tuning BERT, which outperformed other state-of-the-art methods on multiple natural language processing (NLP) tasks. A clinical language model is first pretrained using BERT on a large-scale unlabeled corpus of Chinese clinical text. For NER, the context embeddings that were pretrained using BERT were used as the input features of the Bi-LSTM-CRF (Bidirectional long-short-memory-conditional random fields) model and were fine-tuned using the annotated breast cancer notes. Furthermore, we proposed an approach to fine-tune BERT for relation extraction. It was considered to be a classification problem in which the two entities that were mentioned in the input sentence were replaced with their semantic types. RESULTS Our best-performing system achieved F1 scores of 93.53% for the NER and 96.73% for the relation extraction. Additional evaluations showed that the deep learning-based approaches that fine-tuned BERT did outperform the traditional Bi-LSTM-CRF and CRF machine learning algorithms in NER and the attention-Bi-LSTM and SVM (support vector machines) algorithms in relation recognition. CONCLUSION In this study, we developed a deep learning approach that fine-tuned BERT to extract the breast cancer concepts and their attributes. It demonstrated its superior performance compared to traditional machine learning algorithms, thus supporting its uses in broader NER and relation extraction tasks in the medical domain.

[1]  Jie He,et al.  Cancer incidence and mortality in China, 2014. , 2018, Chinese journal of cancer research = Chung-kuo yen cheng yen chiu.

[2]  Regina Barzilay,et al.  Using machine learning to parse breast pathology reports , 2016, bioRxiv.

[3]  Bharath Dandala,et al.  IBM Research System at MADE 2018: Detecting Adverse Drug Events from Electronic Health Records , 2018, Medication and Adverse Drug Event Detection.

[4]  Stephen T. C. Wong,et al.  Correlating mammographic and pathologic findings in clinical decision support using natural language processing and data mining methods , 2017, Cancer.

[5]  A. Jemal,et al.  Cancer statistics, 2018 , 2018, CA: a cancer journal for clinicians.

[6]  Hong-Jun Yoon,et al.  Deep Learning for Automated Extraction of Primary Sites From Cancer Pathology Reports , 2018, IEEE Journal of Biomedical and Health Informatics.

[7]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[8]  Zhiheng Li,et al.  Integrating shortest dependency path and sentence sequence into a deep learning framework for relation extraction in clinical text , 2019, BMC Medical Informatics and Decision Making.

[9]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[10]  Min Li,et al.  High accuracy information extraction of medication information from clinical notes: 2009 i2b2 medication extraction challenge , 2010, J. Am. Medical Informatics Assoc..

[11]  Hong Yu,et al.  Structured prediction models for RNN based sequence labeling in clinical text , 2016, EMNLP.

[12]  Jingqi Wang,et al.  Enhancing Clinical Concept Extraction with Contextual Embedding , 2019, J. Am. Medical Informatics Assoc..

[13]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[14]  Xiaolong Wang,et al.  De-identification of clinical notes via recurrent neural network and conditional random field. , 2017, Journal of biomedical informatics.

[15]  Maryam Habibi,et al.  Deep learning with word embeddings improves biomedical named entity recognition , 2017, Bioinform..

[16]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[17]  Ramin Khorasani,et al.  Automated Extraction of BI-RADS Final Assessment Categories from Radiology Reports with Natural Language Processing , 2013, Journal of Digital Imaging.

[18]  Xin Zhang,et al.  Extraction of BI-RADS findings from breast ultrasound reports in Chinese using deep learning approaches , 2018, Int. J. Medical Informatics.

[19]  A. Jemal,et al.  Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries , 2018, CA: a cancer journal for clinicians.

[20]  Ming Yang,et al.  Entity recognition from clinical texts via recurrent neural network , 2017, BMC Medical Informatics and Decision Making.

[21]  Kazuhiko Ohe,et al.  Extraction of Adverse Drug Effects from Clinical Records , 2010, MedInfo.

[22]  Arika E. Wieneke,et al.  Validation of natural language processing to extract breast cancer pathology procedures and results , 2015, Journal of pathology informatics.

[23]  Fernanda Polubriaginof,et al.  The feasibility of using natural language processing to extract clinical information from breast pathology reports , 2012, Journal of pathology informatics.

[24]  Patrick R. Alba,et al.  Detecting Adverse Drug Events with Rapidly Trained Classification Models , 2019, Drug Safety.

[25]  David Page,et al.  Information Extraction for Clinical Data Mining: A Mammography Case Study , 2009, 2009 IEEE International Conference on Data Mining Workshops.