Privacy-Preserving Deep Learning for the Detection of Protected Health Information in Real-World Data: Comparative Evaluation

Background Collaborative privacy-preserving training methods allow for the integration of locally stored private data sets into machine learning approaches while ensuring confidentiality and nondisclosure. Objective In this work we assess the performance of a state-of-the-art neural network approach for the detection of protected health information in texts trained in a collaborative privacy-preserving way. Methods The training adopts distributed selective stochastic gradient descent (ie, it works by exchanging local learning results achieved on private data sets). Five networks were trained on separated real-world clinical data sets by using the privacy-protecting protocol. In total, the data sets contain 1304 real longitudinal patient records for 296 patients. Results These networks reached a mean F1 value of 0.955. The gold standard centralized training that is based on the union of all sets and does not take data security into consideration reaches a final value of 0.962. Conclusions Using real-world clinical data, our study shows that detection of protected health information can be secured by collaborative privacy-preserving training. In general, the approach shows the feasibility of deep learning on distributed and confidential clinical data while ensuring data protection.

[1]  Özlem Uzuner,et al.  Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1 , 2015, J. Biomed. Informatics.

[2]  Lynette Hirschman,et al.  The MITRE Identification Scrubber Toolkit: Design, training, and assessment , 2010, Int. J. Medical Informatics.

[3]  Somesh Jha,et al.  Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures , 2015, CCS.

[4]  Xiaolong Wang,et al.  De-identification of clinical notes via recurrent neural network and conditional random field. , 2017, Journal of biomedical informatics.

[5]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[6]  Bruce R. Rosen,et al.  Distributed deep learning networks among institutions for medical imaging , 2018, J. Am. Medical Informatics Assoc..

[7]  Keith Marsolo,et al.  Large-scale evaluation of automated clinical note de-identification and its impact on information extraction , 2013, J. Am. Medical Informatics Assoc..

[8]  Vitaly Shmatikov,et al.  Privacy-preserving deep learning , 2015, 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[9]  Franck Dernoncourt,et al.  De-identification of patient notes with recurrent neural networks , 2016, J. Am. Medical Informatics Assoc..

[10]  Jia Chen,et al.  A Collaborative Privacy-Preserving Deep Learning System in Distributed Mobile Environment , 2016, 2016 International Conference on Computational Science and Computational Intelligence (CSCI).

[11]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[12]  Jürgen Schmidhuber,et al.  LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[13]  Sheng Zhong,et al.  Privacy preserving Back-propagation neural network learning over arbitrarily partitioned data , 2011, Neural Computing and Applications.

[14]  Özlem Uzuner,et al.  Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus , 2015, J. Biomed. Informatics.

[15]  Sheng Zhong,et al.  Privacy-Preserving Backpropagation Neural Network Learning , 2009, IEEE Transactions on Neural Networks.

[16]  Shucheng Yu,et al.  Privacy Preserving Back-Propagation Neural Network Learning Made Practical with Cloud Computing , 2014, IEEE Transactions on Parallel and Distributed Systems.

[17]  S. Meystre,et al.  Automatic de-identification of textual documents in the electronic health record: a review of recent research , 2010, BMC medical research methodology.

[18]  Xiaolong Wang,et al.  Automatic de-identification of electronic medical records using token-level and character-level conditional random fields , 2015, J. Biomed. Informatics.