论文信息 - Semi-Supervised Information Extraction for Cancer Pathology Reports

Semi-Supervised Information Extraction for Cancer Pathology Reports

Pathology reports are a main source of data for cancer surveillance programs. Manual coding of pathology reports is labor-intensive but necessary for obtaining labeled data to train automated information extraction systems. In this study, we investigated semi-supervised deep learning, improving the performance of a multitask information extraction system for automated annotation of pathology reports. We used a set of over 374,000 pathology reports from the Louisiana Tumor Registry and a novel convolutional attention-based auto-encoder. We performed a set of experiments comparing supervised training augmented with unlabeled data at 1%, 5%, 10%, and 50% of the original data size. We also compared the impact of extending text processing to include unlabeled tokens. We find that semi-supervised training consistently improved individual performance with increased micro-averaged F-scores between 0.012 and 0.064 and increased macro-averaged F-scores of up to 0.158. This demonstrates that semantic information learned via unsupervised learning can be used to improve supervised clinical task performance.

[1] Hong-Jun Yoon,et al. Scalable deep text comprehension for Cancer surveillance on high-performance computing , 2018, BMC Bioinformatics.

[2] Bowen Zhou,et al. ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs , 2015, TACL.

[3] Guoyin Wang,et al. Deconvolutional Paragraph Representation Learning , 2017, NIPS.

[4] Wenpeng Yin,et al. Comparative Study of CNN and RNN for Natural Language Processing , 2017, ArXiv.

[5] Hugo Larochelle,et al. An Autoencoder Approach to Learning Bilingual Word Representations , 2014, NIPS.

[6] Marc'Aurelio Ranzato,et al. Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[7] Peter Spyns. Natural Language Processing in Medicine: An Overview , 1996, Methods of Information in Medicine.

[8] Max Welling,et al. Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[9] Hong-Jun Yoon,et al. Filter pruning of Convolutional Neural Networks for text classification: A case study of cancer pathology report comprehension , 2018, 2018 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI).

[10] Ye Zhang,et al. A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification , 2015, IJCNLP.

[11] Bradford F. Kimball,et al. Practical Applications of the Theory of Extreme Values , 1955 .

[12] Andrew McCallum,et al. Collective multi-label classification , 2005, CIKM '05.

[13] Jimeng Sun,et al. Explainable Prediction of Medical Codes from Clinical Text , 2018, NAACL.

[14] Jason Weston,et al. Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[15] Charles A. Sutton,et al. A Convolutional Attention Network for Extreme Summarization of Source Code , 2016, ICML.

[16] Francesco Visin,et al. A guide to convolution arithmetic for deep learning , 2016, ArXiv.

[17] E. Feuer,et al. SEER Cancer Statistics Review, 1975-2003 , 2006 .

[18] Shang Gao,et al. Hierarchical attention networks for information extraction from cancer pathology reports , 2017, J. Am. Medical Informatics Assoc..

[19] Hong-Jun Yoon,et al. Deep Learning for Automated Extraction of Primary Sites From Cancer Pathology Reports , 2018, IEEE Journal of Biomedical and Health Informatics.

[20] Sebastian Ruder,et al. An Overview of Multi-Task Learning in Deep Neural Networks , 2017, ArXiv.