MedSTS: a resource for clinical semantic textual similarity

The adoption of electronic health records (EHRs) has enabled a wide range of applications leveraging EHR data. However, the meaningful use of EHR data largely depends on our ability to efficiently extract and consolidate information embedded in clinical text where natural language processing (NLP) techniques are essential. Semantic textual similarity (STS) that measures the semantic similarity between text snippets plays a significant role in many NLP applications. In the general NLP domain, STS shared tasks have made available a huge collection of text snippet pairs with manual annotations in various domains. In the clinical domain, STS can enable us to detect and eliminate redundant information that may lead to a reduction in cognitive burden and an improvement in the clinical decision-making process. This paper elaborates our efforts to assemble a resource for STS in the medical domain, MedSTS. It consists of a total of 174,629 sentence pairs gathered from a clinical corpus at Mayo Clinic. A subset of MedSTS (MedSTS_ann) containing 1068 sentence pairs was annotated by two medical experts with semantic similarity scores of 0–5 (low to high similarity). We further analyzed the medical concepts in the MedSTS corpus, and tested four STS systems on the MedSTS_ann corpus. In the future, we will organize a shared task by releasing the MedSTS_ann corpus to motivate the community to tackle the real world clinical problems.

[1]  Heng Ji,et al.  Linking Tweets to News: A Framework to Enrich Short Text Data in Social Media , 2013, ACL.

[2]  Mingyuan Yang,et al.  Learning Document Semantic Representation with Hybrid Deep Belief Network , 2015, Comput. Intell. Neurosci..

[3]  Ted Pedersen,et al.  Measures of semantic similarity and relatedness in the biomedical domain , 2007, J. Biomed. Informatics.

[4]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[5]  Zuhair Bandar,et al.  Sentence similarity based on semantic nets and corpus statistics , 2006, IEEE Transactions on Knowledge and Data Engineering.

[6]  Ngoc Phuoc An Vo,et al.  Analysis of the Impact of Machine Translation Evaluation Metrics for Semantic Textual Similarity , 2016, AI*IA.

[7]  Yaoyun Zhang,et al.  UTH_CCB: A report for SemEval 2014 – Task 7 Analysis of Clinical Text , 2014, *SEMEVAL.

[8]  Eneko Agirre,et al.  SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation , 2016, *SEMEVAL.

[9]  Jorge García Duque,et al.  A flexible semantic inference methodology to reason about user preferences in knowledge-based recommender systems , 2008, Knowl. Based Syst..

[10]  Alexander F. Gelbukh,et al.  Semantic Textual Similarity Methods, Tools, and Applications: A Survey , 2016, Computación y Sistemas.

[11]  John Atkinson,et al.  Discovering Implicit Intention-Level Knowledge from Natural-Language Texts , 2008, SGAI Conf..

[12]  Regina Barzilay,et al.  Paraphrasing for Automatic Evaluation , 2006, NAACL.

[13]  John Atkinson,et al.  Discovering implicit intention-level knowledge from natural-language texts , 2008, Knowl. Based Syst..

[14]  Masoud Rahgozar,et al.  A Knowledge-Based Question Answering System for B2C eCommerce , 2008, Fifth International Conference on Information Technology: New Generations (itng 2008).

[15]  Alessandro Raganato,et al.  Semantic Indexing of Multilingual Corpora and its Application on the History Domain , 2016, LT4DH@COLING.

[16]  Hongfang Liu,et al.  A frequency-filtering strategy of obtaining PHI-free sentences from clinical data repository , 2015, BCB.

[17]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[18]  Johanna D. Moore,et al.  Beetle II: A System for Tutoring and Computational Linguistics Experimentation , 2010, ACL.

[19]  David W Bates,et al.  Can electronic clinical documentation help prevent diagnostic errors? , 2010, The New England journal of medicine.

[20]  Hongfang Liu,et al.  Journal of Biomedical Informatics , 2022 .

[21]  Michael D. Reis,et al.  Types and origins of diagnostic errors in primary care settings. , 2013, JAMA internal medicine.

[22]  David Blumenthal,et al.  Implementation of the federal health information technology initiative. , 2011, The New England journal of medicine.

[23]  Ramiz M. Aliguliyev,et al.  A new sentence similarity measure and sentence based extractive technique for automatic text summarization , 2009, Expert Syst. Appl..

[24]  Ruslan Mitkov,et al.  The Oxford handbook of computational linguistics , 2003 .

[25]  Raman Khanna,et al.  Characterizing the Source of Text in Electronic Health Record Progress Notes , 2017, JAMA internal medicine.

[26]  Rohini K. Srihari,et al.  Intelligent Indexing and Semantic Retrieval of Multimodal Documents , 2004, Information Retrieval.

[27]  Ted Pedersen,et al.  Using Measures of Semantic Relatedness for Word Sense Disambiguation , 2003, CICLing.

[28]  Cui Tao,et al.  Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis , 2012, J. Am. Medical Informatics Assoc..

[29]  Kevin Gimpel,et al.  Revisiting Recurrent Networks for Paraphrastic Sentence Embeddings , 2017, ACL.

[30]  Hongfang Liu,et al.  MayoNLP at SemEval-2016 Task 1: Semantic Textual Similarity based on Lexical Semantic Net and Deep Learning Semantic Model , 2016, SemEval@NAACL-HLT.

[31]  Mitchell P. Marcus,et al.  OntoNotes: The 90% Solution , 2006, NAACL.

[32]  Noémie Elhadad,et al.  Automated methods for the summarization of electronic health records , 2015, J. Am. Medical Informatics Assoc..

[33]  K. Pearson VII. Note on regression and inheritance in the case of two parents , 1895, Proceedings of the Royal Society of London.

[34]  Daniel M. Stein,et al.  Research paper: Quantifying clinical narrative redundancy in an electronic health record , 2010, J. Am. Medical Informatics Assoc..

[35]  Tao Li,et al.  Exploiting Sentence Similarities for Better Alignments , 2016, EMNLP.

[36]  Serguei V. S. Pakhomov,et al.  Using Language Models to Identify Relevant New Information in Inpatient Clinical Note , 2014, AMIA.

[37]  Hongfang Liu,et al.  Medical concept intersection between outside medical records and consultant notes: A case study in transferred cardiovascular patients , 2017, 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[38]  Rada Mihalcea,et al.  Measuring the Semantic Similarity of Texts , 2005, EMSEE@ACL.

[39]  David K. Vawdrey,et al.  HARVEST, a longitudinal patient record summarizer , 2014, J. Am. Medical Informatics Assoc..

[40]  Rafael Dueire Lins,et al.  Assessing sentence similarity through lexical, syntactic and semantic analysis , 2016, Comput. Speech Lang..

[41]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[42]  Eneko Agirre,et al.  *SEM 2013 shared task: Semantic Textual Similarity , 2013, *SEMEVAL.

[43]  Claire Cardie,et al.  SemEval-2014 Task 10: Multilingual Semantic Textual Similarity , 2014, *SEMEVAL.

[44]  Diana Inkpen,et al.  Semantic text similarity using corpus-based word similarity and string similarity , 2008, ACM Trans. Knowl. Discov. Data.

[45]  A. Rosenfeld,et al.  IEEE TRANSACTIONS ON SYSTEMS , MAN , AND CYBERNETICS , 2022 .

[46]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[47]  Hongfang Liu,et al.  Leveraging word embeddings and medical entity extraction for biomedical dataset retrieval using unstructured texts , 2017, Database J. Biol. Databases Curation.

[48]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[49]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[50]  Iryna Gurevych,et al.  UKP: Computing Semantic Textual Similarity by Combining Multiple Content Similarity Measures , 2012, *SEMEVAL.

[51]  Mark Stevenson,et al.  Developing a corpus of plagiarised short answers , 2011, Lang. Resour. Evaluation.

[52]  Hongfang Liu,et al.  BioCreative/OHNLP Challenge 2018 , 2018, BCB.

[53]  Jan Snajder,et al.  TakeLab: Systems for Measuring Semantic Text Similarity , 2012, *SEMEVAL.

[54]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[55]  Noémie Elhadad,et al.  Natural Language Processing in Health Care and Biomedicine , 2014 .

[56]  Claudia H. Williams,et al.  From the Office of the National Coordinator: the strategy for advancing the exchange of health information. , 2012, Health affairs.

[57]  Hongfang Liu,et al.  A Comparison of Word Embeddings for the Biomedical Natural Language Processing , 2018, J. Biomed. Informatics.

[58]  Claire Cardie,et al.  SemEval-2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability , 2015, *SEMEVAL.

[59]  Masoud Rahgozar,et al.  A Knowledge-Based Question Answering System for B2C eCommerce , 2008, ITNG.

[60]  Regina Barzilay,et al.  Sentence Fusion for Multidocument News Summarization , 2005, CL.

[61]  Eneko Agirre,et al.  SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity , 2012, *SEMEVAL.

[62]  Suresh Manandhar,et al.  SemEval-2014 Task 7: Analysis of Clinical Text , 2014, *SEMEVAL.

[63]  Peter Basch,et al.  Clinical Documentation in the 21 st Century : Executive Summary of a Policy Position Paper From the American College of Physicians , 2015 .

[64]  Charlene R. Weir,et al.  Computerized provider documentation: findings and implications of a multisite study of clinicians and administrators , 2013, J. Am. Medical Informatics Assoc..

[65]  Peter Basch,et al.  Clinical documentation in the 21st century: executive summary of a policy position paper from the American College of Physicians. , 2015, Annals of internal medicine.

[66]  Bridget T. McInnes,et al.  Evaluating measures of redundancy in clinical texts. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.