Evaluating semantic textual similarity in clinical sentences using deep learning and sentence embeddings

The wide adoption of electronic health records (EHRs) has fostered an improvement in healthcare quality, with EHRs currently representing a major source of medical information. Nevertheless, this process has also brought new challenges to the medical environment since the facilitated replication of information (e.g. using copy-paste) has resulted in less concise and sometimes incorrect information, which hinders the understandability of this data and can compromise the quality of medical decisions drawn from it. Due to the high volume and redundancy in medical data, it is imperative to develop solutions that can condense information whilst retaining its value, with a possible methodology involving the assessment of the semantic similarity between clinical text excerpts. In this paper we present an approach that explores neural networks and different types of text preprocessing pipelines, and that evaluates the impact of using word embeddings or sentence embeddings. We present the results following our participation in the n2c2 shared-task on clinical semantic textual similarity, perform an error analysis and discuss obtained results along with possible future improvements.

[1]  Noémie Elhadad,et al.  Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies , 2013, BMC Bioinformatics.

[2]  Kent A. Spackman,et al.  SNOMED clinical terms: overview of the development process and project status , 2001, AMIA.

[3]  Henrique M. G. Martins,et al.  Using Structured EHR Data and SVM to Support ICD-9-CM Coding , 2013, 2013 IEEE International Conference on Healthcare Informatics.

[4]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[5]  Hongfang Liu,et al.  MedSTS: a resource for clinical semantic textual similarity , 2018, Language Resources and Evaluation.

[6]  Peter Basch,et al.  Clinical Documentation in the 21 st Century : Executive Summary of a Policy Position Paper From the American College of Physicians , 2015 .

[7]  Claire Cardie,et al.  SemEval-2014 Task 10: Multilingual Semantic Textual Similarity , 2014, *SEMEVAL.

[8]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[9]  Peter Basch,et al.  Clinical documentation in the 21st century: executive summary of a policy position paper from the American College of Physicians. , 2015, Annals of internal medicine.

[10]  Eneko Agirre,et al.  SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity , 2012, *SEMEVAL.

[11]  Anna Rumshisky,et al.  MCN: A comprehensive corpus for medical concept normalization , 2019, J. Biomed. Informatics.

[12]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[13]  Henrique M. G. Martins,et al.  Preprocessing structured clinical data for predictive modeling and decision support , 2016, Applied Clinical Informatics.

[14]  Noémie Elhadad,et al.  Automated methods for the summarization of electronic health records , 2015, J. Am. Medical Informatics Assoc..

[15]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[16]  Arzucan Özgür,et al.  BIOSSES: a semantic sentence similarity estimation system for the biomedical domain , 2017, Bioinform..

[17]  Matteo Pagliardini,et al.  Unsupervised Learning of Sentence Embeddings Using Compositional n-Gram Features , 2017, NAACL.

[18]  Yifan Peng,et al.  BioSentVec: creating sentence embeddings for biomedical texts , 2018, 2019 IEEE International Conference on Healthcare Informatics (ICHI).

[19]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[20]  M. Girolami,et al.  Analysis of free text in electronic health records for identification of cancer patient trajectories , 2017, Scientific Reports.

[21]  Eneko Agirre,et al.  SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation , 2016, *SEMEVAL.

[22]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[23]  Timothy Baldwin,et al.  Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity , 2013 .

[24]  Qingyu Chen,et al.  BioWordVec, improving biomedical word embeddings with subword information and MeSH , 2019, Scientific Data.

[25]  Jingcheng Du,et al.  Evaluation of Five Sentence Similarity Models on Electronic Medical Records , 2019, BCB.

[26]  Stuart J. Nelson,et al.  Normalized names for clinical drugs: RxNorm at 6 years , 2011, J. Am. Medical Informatics Assoc..

[27]  Liliana Ferreira,et al.  1. Application of text mining to biomedical knowledge extraction: analyzing clinical narratives and medical literature , 2014 .

[28]  Jason Roy,et al.  Prediction Modeling Using EHR Data: Challenges, Strategies, and a Comparison of Machine Learning Approaches , 2010, Medical care.

[29]  Claire Cardie,et al.  SemEval-2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability , 2015, *SEMEVAL.

[30]  Amy Neustein Text Mining of Web-based Medical Content , 2014 .

[31]  Tianlin Liu,et al.  Correcting the Common Discourse Bias in Linear Representation of Sentences using Conceptors , 2018, ArXiv.

[32]  Michael D. Reis,et al.  Types and origins of diagnostic errors in primary care settings. , 2013, JAMA internal medicine.