Selective Information Extraction Strategies for Cancer Pathology Reports with Convolutional Neural Networks

To trust model predictions, it is important to ensure new data scored by the model comes from the same population used for model training. If the model is used to score new data different than the model’s training data, then predictions and model performance metrics cannot be trusted. Identifying and excluding these anomalous data points is an important task when using models in the real world. Traditional machine learning algorithms and classifiers don’t have the capability to abstain in this case. Here we propose a data-novelty detection algorithm for the Convolutional Neural Network classifier, yielding a rejection score for each new data point scored. It is a post-modeling procedure which examines the distribution of convolution filters to determine if the prediction should be trusted. We apply this algorithm to an information extraction model for a natural language text corpus. We evaluated the algorithm performance using a primary cancer site classification model applied to cancer pathology reports. Results demonstrate that the algorithm is an effective way to exclude cancer pathology reports from model scoring when they do not contain the expected information necessary to accurately classify the primary cancer type.

[1]  B. Scheithauer,et al.  The 2007 WHO classification of tumours of the central nervous system , 2007, Acta Neuropathologica.

[2]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[3]  Youyong Kong,et al.  Deep and Structured Robust Information Theoretic Learning for Image Analysis , 2016, IEEE Transactions on Image Processing.

[4]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[5]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[6]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[7]  Yann LeCun,et al.  Learning to Linearize Under Uncertainty , 2015, NIPS.

[8]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[9]  Harris Papadopoulos,et al.  Inductive Conformal Prediction: Theory and Application to Neural Networks , 2008 .

[10]  Eric B. Durbin,et al.  Automatic Extraction of ICD-O-3 Primary Sites from Cancer Pathology Reports , 2013, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[11]  Hong-Jun Yoon,et al.  Filter pruning of Convolutional Neural Networks for text classification: A case study of cancer pathology report comprehension , 2018, 2018 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI).

[12]  Hong-Jun Yoon,et al.  Deep Learning for Automated Extraction of Primary Sites From Cancer Pathology Reports , 2018, IEEE Journal of Biomedical and Health Informatics.

[13]  Vladimir Vovk,et al.  A tutorial on conformal prediction , 2007, J. Mach. Learn. Res..

[14]  Anthony N. Nguyen,et al.  Automatic Extraction of Cancer Characteristics from Free-Text Pathology Reports for Cancer Notifications , 2011, HIC.

[15]  L. A. Goodman On the Exact Variance of Products , 1960 .