Ranking medical jargon in electronic health record notes by adapted distant supervision

Objective: Allowing patients to access their own electronic health record (EHR) notes through online patient portals has the potential to improve patient-centered care. However, medical jargon, which abounds in EHR notes, has been shown to be a barrier for patient EHR comprehension. Existing knowledge bases that link medical jargon to lay terms or definitions play an important role in alleviating this problem but have low coverage of medical jargon in EHRs. We developed a data-driven approach that mines EHRs to identify and rank medical jargon based on its importance to patients, to support the building of EHR-centric lay language resources. Methods: We developed an innovative adapted distant supervision (ADS) model based on support vector machines to rank medical jargon from EHRs. For distant supervision, we utilized the open-access, collaborative consumer health vocabulary, a large, publicly available resource that links lay terms to medical jargon. We explored both knowledge-based features from the Unified Medical Language System and distributed word representations learned from unlabeled large corpora. We evaluated the ADS model using physician-identified important medical terms. Results: Our ADS model significantly surpassed two state-of-the-art automatic term recognition methods, TF*IDF and C-Value, yielding 0.810 ROC-AUC versus 0.710 and 0.667, respectively. Our model identified 10K important medical jargon terms after ranking over 100K candidate terms mined from over 7,500 EHR narratives. Conclusion: Our work is an important step towards enriching lexical resources that link medical jargon to lay terms/definitions to support patient EHR comprehension. The identified medical jargon terms and their rankings are available upon request.

[1]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[2]  Diana J. Mason,et al.  Promoting Health Literacy , 2001 .

[3]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[4]  Cynthia Brandt,et al.  Improving Patients' Electronic Health Record Comprehension with NoteAid , 2013, MedInfo.

[5]  Dagobert Soergel,et al.  Exploring Medical Expressions Used by Consumers and the Media: An Emerging View of Consumer Health Vocabularies , 2003, AMIA.

[6]  Alla Keselman,et al.  Term Identification Methods for Consumer Health Vocabulary Development , 2007, Journal of medical Internet research.

[7]  Allen C. Browne,et al.  Identifying Consumer-Friendly Display (CFD) Names for Health Concepts , 2005, AMIA.

[8]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[9]  Philip S. Yu,et al.  Text classification without negative examples revisit , 2006, IEEE Transactions on Knowledge and Data Engineering.

[10]  Tapio Salakoski,et al.  Distributional Semantics Resources for Biomedical Text Processing , 2013 .

[11]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[12]  Dietrich Rebholz-Schuhmann,et al.  Facilitating the development of controlled vocabularies for metabolomics technologies with text mining , 2008, BMC Bioinformatics.

[13]  Yoshimi Suzuki,et al.  Text Classification from Positive and Unlabeled Data using Misclassified Data Correction , 2013, ACL.

[14]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[15]  Gondy Leroy,et al.  Research Paper: Consumer Health Concepts That Do Not Map to the UMLS: Where Do They Fit? , 2008, J. Am. Medical Informatics Assoc..

[16]  Charles Abraham,et al.  Lay understanding of terms used in cancer consultations , 2003, Psycho-oncology.

[17]  Robert A. Greenes,et al.  Patient and Clinician Vocabulary: How Different Are They? , 2001, MedInfo.

[18]  E. Lerner,et al.  Medical communication: do our patients understand? , 2000, The American journal of emergency medicine.

[19]  Ellen M A Smets,et al.  Lay understanding of common medical terminology in oncology , 2013, Psycho-oncology.

[20]  Rémi Gilleron,et al.  Learning from positive and unlabeled examples , 2000, Theor. Comput. Sci..

[21]  Stephen B. Strum,et al.  Inviting Patients to Read Doctors' Notes , 2012 .

[22]  Qing Zeng-Treitler,et al.  A semantic and syntactic text simplification tool for health content. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[23]  Qing Zeng-Treitler,et al.  Computer-Assisted Update of a Consumer Health Vocabulary Through Mining of Social Network Data , 2011, Journal of medical Internet research.

[24]  Ziqi Zhang,et al.  A Comparative Evaluation of Term Recognition Algorithms , 2008, LREC.

[25]  Bing Liu,et al.  Learning with Positive and Unlabeled Examples Using Weighted Logistic Regression , 2003, ICML.

[26]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[27]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[28]  Gail Graham,et al.  Evaluating Patient Access to Electronic Health Records: Results From a Survey of Veterans , 2013, Medical care.

[29]  Alla Keselman,et al.  Towards Consumer-Friendly PHRs: Patients' Experience with Reviewing Their Health Records , 2007, AMIA.

[30]  Rémi Gilleron,et al.  Text Classification from Positive and Unlabeled Examples , 2002 .

[31]  Alexa T. McCray,et al.  Terminology issues in user access to Web-based medical information , 1999, AMIA.

[32]  Qing Zeng-Treitler,et al.  Exploring and developing consumer health vocabularies. , 2006, Journal of the American Medical Informatics Association : JAMIA.

[33]  R A Greenes,et al.  Characteristics of Consumer Terminology for Health Information Retrieval , 2002, Methods of Information in Medicine.

[34]  Alla Keselman,et al.  Making Texts in Electronic Health Records Comprehensible to Consumers: A Prototype Translator , 2007, AMIA.

[35]  Jean-Philippe Vert,et al.  A bagging SVM to learn from positive and unlabeled examples , 2010, Pattern Recognit. Lett..

[36]  C. Pyper,et al.  Patients' experiences when accessing their on-line electronic patient records in primary care. , 2004, The British journal of general practice : the journal of the Royal College of General Practitioners.

[37]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[38]  Philip S. Yu,et al.  Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.

[39]  Hong Yu,et al.  Finding Important Terms for Patients in Their Electronic Health Records: A Learning-to-Rank Approach Using Expert Annotations , 2016, JMIR medical informatics.

[40]  Harpreet K. Monga,et al.  Evaluation of Controlled Vocabulary Resources for Development of a Consumer Entry Vocabulary for Diabetes , 2001, Journal of medical Internet research.

[41]  Hsuan-Tien Lin,et al.  A note on Platt’s probabilistic outputs for support vector machines , 2007, Machine Learning.

[42]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[43]  Charles Elkan,et al.  Learning classifiers from only positive and unlabeled data , 2008, KDD.

[44]  Maria Kvist,et al.  Medical text simplification using synonym replacement: Adapting assessment of word difficulty to a compounding language , 2014, PITR@EACL.

[45]  Xavier Robin,et al.  pROC: an open-source package for R and S+ to analyze and compare ROC curves , 2011, BMC Bioinformatics.