Semi-Supervised Information Extraction for Cancer Pathology Reports

Pathology reports are a main source of data for cancer surveillance programs. Manual coding of pathology reports is labor-intensive but necessary for obtaining labeled data to train automated information extraction systems. In this study, we investigated semi-supervised deep learning, improving the performance of a multitask information extraction system for automated annotation of pathology reports. We used a set of over 374,000 pathology reports from the Louisiana Tumor Registry and a novel convolutional attention-based auto-encoder. We performed a set of experiments comparing supervised training augmented with unlabeled data at 1%, 5%, 10%, and 50% of the original data size. We also compared the impact of extending text processing to include unlabeled tokens. We find that semi-supervised training consistently improved individual performance with increased micro-averaged F-scores between 0.012 and 0.064 and increased macro-averaged F-scores of up to 0.158. This demonstrates that semantic information learned via unsupervised learning can be used to improve supervised clinical task performance.

[1]  Hong-Jun Yoon,et al.  Scalable deep text comprehension for Cancer surveillance on high-performance computing , 2018, BMC Bioinformatics.

[2]  Bowen Zhou,et al.  ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs , 2015, TACL.

[3]  Guoyin Wang,et al.  Deconvolutional Paragraph Representation Learning , 2017, NIPS.

[4]  Wenpeng Yin,et al.  Comparative Study of CNN and RNN for Natural Language Processing , 2017, ArXiv.

[5]  Hugo Larochelle,et al.  An Autoencoder Approach to Learning Bilingual Word Representations , 2014, NIPS.

[6]  Marc'Aurelio Ranzato,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[7]  Peter Spyns Natural Language Processing in Medicine: An Overview , 1996, Methods of Information in Medicine.

[8]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[9]  Hong-Jun Yoon,et al.  Filter pruning of Convolutional Neural Networks for text classification: A case study of cancer pathology report comprehension , 2018, 2018 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI).

[10]  Ye Zhang,et al.  A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification , 2015, IJCNLP.

[11]  Bradford F. Kimball,et al.  Practical Applications of the Theory of Extreme Values , 1955 .

[12]  Andrew McCallum,et al.  Collective multi-label classification , 2005, CIKM '05.

[13]  Jimeng Sun,et al.  Explainable Prediction of Medical Codes from Clinical Text , 2018, NAACL.

[14]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[15]  Charles A. Sutton,et al.  A Convolutional Attention Network for Extreme Summarization of Source Code , 2016, ICML.

[16]  Francesco Visin,et al.  A guide to convolution arithmetic for deep learning , 2016, ArXiv.

[17]  E. Feuer,et al.  SEER Cancer Statistics Review, 1975-2003 , 2006 .

[18]  Shang Gao,et al.  Hierarchical attention networks for information extraction from cancer pathology reports , 2017, J. Am. Medical Informatics Assoc..

[19]  Hong-Jun Yoon,et al.  Deep Learning for Automated Extraction of Primary Sites From Cancer Pathology Reports , 2018, IEEE Journal of Biomedical and Health Informatics.

[20]  Sebastian Ruder,et al.  An Overview of Multi-Task Learning in Deep Neural Networks , 2017, ArXiv.