Use of a support vector machine for categorizing free-text notes: assessment of accuracy across two institutions

BACKGROUND Electronic health record (EHR) users must regularly review large amounts of data in order to make informed clinical decisions, and such review is time-consuming and often overwhelming. Technologies like automated summarization tools, EHR search engines and natural language processing have been shown to help clinicians manage this information. OBJECTIVE To develop a support vector machine (SVM)-based system for identifying EHR progress notes pertaining to diabetes, and to validate it at two institutions. MATERIALS AND METHODS We retrieved 2000 EHR progress notes from patients with diabetes at the Brigham and Women's Hospital (1000 for training and 1000 for testing) and another 1000 notes from the University of Texas Physicians (for validation). We manually annotated all notes and trained a SVM using a bag of words approach. We then used the SVM on the testing and validation sets and evaluated its performance with the area under the curve (AUC) and F statistics. RESULTS The model accurately identified diabetes-related notes in both the Brigham and Women's Hospital testing set (AUC=0.956, F=0.934) and the external University of Texas Faculty Physicians validation set (AUC=0.947, F=0.935). DISCUSSION Overall, the model we developed was quite accurate. Furthermore, it generalized, without loss of accuracy, to another institution with a different EHR and a distinct patient and provider population. CONCLUSIONS It is possible to use a SVM-based classifier to identify EHR progress notes pertaining to diabetes, and the model generalizes well.

[1]  Pierre Zweigenbaum,et al.  Hybrid methods for improving information access in clinical documents: concept, assertion, and relation identification , 2011, J. Am. Medical Informatics Assoc..

[2]  Daniel M. Stein,et al.  An analysis of clinical queries in an electronic health record search utility , 2010, Int. J. Medical Informatics.

[3]  Adam Wright,et al.  Clinician attitudes toward and use of electronic problem lists: a thematic analysis , 2011, BMC Medical Informatics Decis. Mak..

[4]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[5]  Imre Solti1,et al.  Automated classification of radiology reports for acute lung injury: Comparison of keyword and machine learning based natural language processing approaches , 2009, 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshop.

[6]  G. Walton,et al.  Information overload within the health care system: a literature review. , 2004, Health information and libraries journal.

[7]  Prakash M. Nadkarni,et al.  Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions , 2011, J. Am. Medical Informatics Assoc..

[8]  J. Denny,et al.  Naïve Electronic Health Record phenotype identification for Rheumatoid arthritis. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[9]  Scott T. Weiss,et al.  Extracting principal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural language processing system , 2006, BMC Medical Informatics Decis. Mak..

[10]  Daniel M. Stein,et al.  Research paper: Quantifying clinical narrative redundancy in an electronic health record , 2010, J. Am. Medical Informatics Assoc..

[11]  Dario A. Giuse,et al.  StarTracker: An Integrated, Web-based Clinical Search Engine , 2003, AMIA.

[12]  Sanda M. Harabagiu,et al.  A flexible framework for deriving assertions from electronic medical records , 2011, J. Am. Medical Informatics Assoc..

[13]  Thomas Lengauer,et al.  ROCR: visualizing classifier performance in R , 2005, Bioinform..

[14]  Christopher G. Chute,et al.  Technical Brief: Mayo Clinic NLP System for Patient Smoking Status Identification , 2008, J. Am. Medical Informatics Assoc..

[15]  Joi L. Moore,et al.  Using semantic search to reduce cognitive load in an electronic health record , 2011, 2011 IEEE 13th International Conference on e-Health Networking, Applications and Services.

[16]  Carol Friedman,et al.  Two biomedical sublanguages: a description based on the theories of Zellig Harris , 2002, J. Biomed. Informatics.

[17]  Dean F. Sittig,et al.  Clinical Summarization Capabilities of Commercially-available and Internally-developed Electronic Health Records , 2012, Applied Clinical Informatics.

[18]  K. Mandl,et al.  Patients treated at multiple acute health care facilities: quantifying information fragmentation. , 2010, Archives of internal medicine.

[19]  G Hripcsak,et al.  Natural language processing and its future in medicine. , 1999, Academic medicine : journal of the Association of American Medical Colleges.

[20]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[21]  Lucila Ohno-Machado,et al.  Natural language processing: an introduction , 2011, J. Am. Medical Informatics Assoc..

[22]  David A. Hanauer,et al.  EMERSE: The Electronic Medical Record Search Engine , 2006, AMIA.

[23]  Yuan Luo,et al.  Identifying patient smoking status from medical discharge records. , 2008, Journal of the American Medical Informatics Association : JAMIA.

[24]  Lucila Ohno-Machado,et al.  Realizing the full potential of electronic health records: the role of natural language processing , 2011, J. Am. Medical Informatics Assoc..

[25]  Dean F Sittig,et al.  A prototype knowledge base and SMART app to facilitate organization of patient medications by clinical problems. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[26]  Sunghwan Sohn,et al.  Mayo Clinic Smoking Status Classification System: Extensions and Improvements , 2009, AMIA.

[27]  Adam Wright,et al.  Development and evaluation of a crowdsourcing methodology for knowledge base construction: identifying relationships between clinical problems and medications , 2012, J. Am. Medical Informatics Assoc..

[28]  Adam Wright,et al.  Summarization of clinical information: A conceptual model , 2011, J. Biomed. Informatics.

[29]  Adam Wright,et al.  An automated technique for identifying associations between medications, laboratory results and problems , 2010, J. Biomed. Informatics.