Privacy-Preserving Deep Learning NLP Models for Cancer Registries

Population cancer registries can benefit from Deep Learning (DL) to automatically extract cancer characteristics from the high volume of unstructured pathology text reports they process annually. The success of DL to tackle this and other real-world problems is proportional to the availability of large labeled datasets for model training. Although collaboration among cancer registries is essential to fully exploit the promise of DL, privacy and confidentiality concerns are main obstacles for data sharing across cancer registries. Moreover, DL for natural language processing (NLP) requires sharing a vocabulary dictionary for the embedding layer which may contain patient identifiers. Thus, even distributing the trained models across cancer registries causes a privacy violation issue. In this article, we propose DL NLP model distribution via privacy-preserving transfer learning approaches without sharing sensitive data. These approaches are used to distribute a multitask convolutional neural network (MT-CNN) NLP model among cancer registries. The model is trained to extract six key cancer characteristics – tumor site, subsite, laterality, behavior, histology, and grade – from cancer pathology reports. Using 410,064 pathology documents from two cancer registries, we compare our proposed approach to conventional transfer learning without privacy-preserving, single-registry models, and a model trained on centrally hosted data. The results show that transfer learning approaches including data sharing and model distribution outperform significantly the single-registry model. In addition, the best performing privacy-preserving model distribution approach achieves statistically indistinguishable average micro- and macro-F1 scores across all extraction tasks (0.823,0.580) as compared to the centralized model (0.827,0.585).

[1]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[2]  Shiho Moriai,et al.  Privacy-Preserving Deep Learning via Additively Homomorphic Encryption , 2019, 2019 IEEE 26th Symposium on Computer Arithmetic (ARITH).

[3]  Mohammed Alawad,et al.  Retrofitting Word Embeddings with the UMLS Metathesaurus for Clinical Information Extraction , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[4]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[5]  Timothy Baldwin,et al.  Towards Robust and Privacy-preserving Text Representations , 2018, ACL.

[6]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[7]  Xiaoqian Jiang,et al.  A privacy-preserving distributed filtering framework for NLP artifacts , 2019, BMC Medical Informatics and Decision Making.

[8]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[9]  Shang Gao,et al.  Classifying cancer pathology reports with hierarchical self-attention networks , 2019, Artif. Intell. Medicine.

[10]  Hong-Jun Yoon,et al.  Deep Learning for Automated Extraction of Primary Sites From Cancer Pathology Reports , 2018, IEEE Journal of Biomedical and Health Informatics.

[11]  Samy Bengio,et al.  Revisiting Distributed Synchronous SGD , 2016, ArXiv.

[12]  John X. Qiu,et al.  Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks , 2019, J. Am. Medical Informatics Assoc..

[13]  Andrew Y. Ng,et al.  CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning , 2017, ArXiv.

[14]  Vitaly Shmatikov,et al.  Privacy-preserving deep learning , 2015, 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[15]  Giuseppe Ateniese,et al.  Deep Models Under the GAN: Information Leakage from Collaborative Deep Learning , 2017, CCS.

[16]  Hongfang Liu,et al.  Journal of Biomedical Informatics , 2022 .

[17]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[18]  Sebastian Ruder,et al.  An Overview of Multi-Task Learning in Deep Neural Networks , 2017, ArXiv.

[19]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[20]  Daniel Rueckert,et al.  A generic framework for privacy preserving deep learning , 2018, ArXiv.

[21]  Frank Rudzicz,et al.  A survey of word embeddings for clinical text , 2019, J. Biomed. Informatics X.

[22]  Hua Xu,et al.  A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries , 2011, J. Am. Medical Informatics Assoc..

[23]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[24]  Trevor Darrell,et al.  Transfer learning for image classification with sparse prototype representations , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Hong-Jun Yoon,et al.  Deep Transfer Learning Across Cancer Registries for Information Extraction from Pathology Reports , 2019, 2019 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI).

[26]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[27]  Promod Yenigalla,et al.  A Practitioners' Guide to Transfer Learning for Text Classification using Convolutional Neural Networks , 2018, SDM.

[28]  Michele Filannino,et al.  De-identification of psychiatric intake records: Overview of 2016 CEGS N-GRID shared tasks Track 1. , 2017, Journal of biomedical informatics.

[29]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[30]  Bruce R. Rosen,et al.  Distributed deep learning networks among institutions for medical imaging , 2018, J. Am. Medical Informatics Assoc..

[31]  Shang Gao,et al.  Hierarchical attention networks for information extraction from cancer pathology reports , 2017, J. Am. Medical Informatics Assoc..

[32]  Hong-Jun Yoon,et al.  Coarse-to-fine multi-task training of convolutional neural networks for automated information extraction from cancer pathology reports , 2018, 2018 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI).

[33]  Ramesh Raskar,et al.  Split learning for health: Distributed deep learning without sharing raw patient data , 2018, ArXiv.

[34]  Qingyu Chen,et al.  BioWordVec, improving biomedical word embeddings with subword information and MeSH , 2019, Scientific Data.

[35]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[36]  Phillip M. Cheng,et al.  Transfer Learning with Convolutional Neural Networks for Classification of Abdominal Ultrasound Images , 2017, Journal of Digital Imaging.

[37]  M. Hepple,et al.  Identifying Personal Health Information Using Support Vector Machines , 2006 .

[38]  Tianxi Cai,et al.  Clinical Concept Embeddings Learned from Massive Sources of Medical Data , 2018, ArXiv.

[39]  Vitaly Feldman,et al.  Privacy-preserving Prediction , 2018, COLT.

[40]  Ian Goodfellow,et al.  Deep Learning with Differential Privacy , 2016, CCS.

[41]  Ramesh Raskar,et al.  No Peek: A Survey of private distributed deep learning , 2018, ArXiv.

[42]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.