Natural Language Processing of Clinical Notes on Chronic Diseases: Systematic Review

Background Novel approaches that complement and go beyond evidence-based medicine are required in the domain of chronic diseases, given the growing incidence of such conditions on the worldwide population. A promising avenue is the secondary use of electronic health records (EHRs), where patient data are analyzed to conduct clinical and translational research. Methods based on machine learning to process EHRs are resulting in improved understanding of patient clinical trajectories and chronic disease risk prediction, creating a unique opportunity to derive previously unknown clinical insights. However, a wealth of clinical histories remains locked behind clinical narratives in free-form text. Consequently, unlocking the full potential of EHR data is contingent on the development of natural language processing (NLP) methods to automatically transform clinical text into structured clinical data that can guide clinical decisions and potentially delay or prevent disease onset. Objective The goal of the research was to provide a comprehensive overview of the development and uptake of NLP methods applied to free-text clinical notes related to chronic diseases, including the investigation of challenges faced by NLP methodologies in understanding clinical narratives. Methods Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines were followed and searches were conducted in 5 databases using “clinical notes,” “natural language processing,” and “chronic disease” and their variations as keywords to maximize coverage of the articles. Results Of the 2652 articles considered, 106 met the inclusion criteria. Review of the included papers resulted in identification of 43 chronic diseases, which were then further classified into 10 disease categories using the International Classification of Diseases, 10th Revision. The majority of studies focused on diseases of the circulatory system (n=38) while endocrine and metabolic diseases were fewest (n=14). This was due to the structure of clinical records related to metabolic diseases, which typically contain much more structured data, compared with medical records for diseases of the circulatory system, which focus more on unstructured data and consequently have seen a stronger focus of NLP. The review has shown that there is a significant increase in the use of machine learning methods compared to rule-based approaches; however, deep learning methods remain emergent (n=3). Consequently, the majority of works focus on classification of disease phenotype with only a handful of papers addressing extraction of comorbidities from the free text or integration of clinical notes with structured data. There is a notable use of relatively simple methods, such as shallow classifiers (or combination with rule-based methods), due to the interpretability of predictions, which still represents a significant issue for more complex methods. Finally, scarcity of publicly available data may also have contributed to insufficient development of more advanced methods, such as extraction of word embeddings from clinical notes. Conclusions Efforts are still required to improve (1) progression of clinical NLP methods from extraction toward understanding; (2) recognition of relations among entities rather than entities in isolation; (3) temporal extraction to understand past, current, and future clinical events; (4) exploitation of alternative sources of clinical knowledge; and (5) availability of large-scale, de-identified clinical corpora.

[1]  Frederick Reiss,et al.  Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems! , 2013, EMNLP.

[2]  Serguei V. S. Pakhomov,et al.  Automatic Quality of Life Prediction Using Electronic Medical Records , 2008, AMIA.

[3]  Shuying Shen,et al.  Automated extraction of ejection fraction for quality measurement using regular expressions in Unstructured Information Management Architecture (UIMA) for heart failure , 2012, J. Am. Medical Informatics Assoc..

[4]  M. Levy,et al.  ReCAP: Feasibility and Accuracy of Extracting Cancer Stage Information From Narrative Electronic Health Record Data. , 2016, Journal of oncology practice.

[5]  Hongfang Liu,et al.  Natural language processing of clinical notes for identification of critical limb ischemia , 2017, Int. J. Medical Informatics.

[6]  M. Girolami,et al.  Analysis of free text in electronic health records for identification of cancer patient trajectories , 2017, Scientific Reports.

[7]  Jennifer A. Haythornthwaite,et al.  Longitudinal analysis of pain in patients with metastatic prostate cancer using natural language processing of medical record text , 2012, J. Am. Medical Informatics Assoc..

[8]  Herbert S. Chase,et al.  Early recognition of multiple sclerosis using natural language processing of the electronic health record , 2017, BMC Medical Informatics and Decision Making.

[9]  Allen Russell,et al.  Lower visual acuity predicts worse utility values among patients with type 2 diabetes , 2008, Quality of Life Research.

[10]  Steven Bethard,et al.  Efficient identification of nationally mandated reportable cancer cases using natural language processing and machine learning , 2016, J. Am. Medical Informatics Assoc..

[11]  G. Gobbel,et al.  Automating Quality Measures for Heart Failure Using Natural Language Processing: A Descriptive Study in the Department of Veterans Affairs , 2018, JMIR medical informatics.

[12]  Xin Liu,et al.  An automatic system to identify heart disease risk factors in clinical texts over time , 2015, J. Biomed. Informatics.

[13]  Chengyi Zheng,et al.  Medication Extraction from Electronic Clinical Notes in an Integrated Health System: A Study on Aspirin Use in Patients with Nonvalvular Atrial Fibrillation. , 2015, Clinical therapeutics.

[14]  Loes M. M. Braun,et al.  Natural Language Processing in Radiology: A Systematic Review. , 2016, Radiology.

[15]  Jihad S. Obeid,et al.  Word2Vec inversion and traditional text classifiers for phenotyping lupus , 2017, BMC Medical Informatics and Decision Making.

[16]  A Burgun,et al.  Automated Classification of Free-text Pathology Reports for Registration of Incident Cases of Cancer , 2011, Methods of Information in Medicine.

[17]  Shaun J. Grannis,et al.  Toward better public health reporting using existing off the shelf approaches: The value of medical dictionaries in automated cancer detection using plaintext medical data , 2017, J. Biomed. Informatics.

[18]  Wendy W. Chapman,et al.  Document-level classification of CT pulmonary angiography reports based on an extension of the ConText algorithm , 2011, J. Biomed. Informatics.

[19]  Stéphane M. Meystre,et al.  Extraction of left ventricular ejection fraction information from various types of clinical reports , 2017, J. Biomed. Informatics.

[20]  Barbara J. Grosz,et al.  Natural-Language Processing , 1982, Artificial Intelligence.

[21]  E. Mohammadi,et al.  Barriers and facilitators related to the implementation of a physiological track and trigger system: A systematic review of the qualitative evidence , 2017, International journal for quality in health care : journal of the International Society for Quality in Health Care.

[22]  Yijun Shao,et al.  Identifying Axial Spondyloarthritis in Electronic Medical Records of US Veterans , 2016, Arthritis care & research.

[23]  James J Arnzen,et al.  Towards Automatic Diabetes Case Detection and ABCS Protocol Compliance Assessment , 2012, Clinical Medicine & Research.

[24]  Serguei V. S. Pakhomov,et al.  Automated processing of electronic medical records is a reliable method of determining aspirin use in populations at risk for cardiovascular events. , 2010, Informatics in primary care.

[25]  Joseph Geraci,et al.  Applying deep neural networks to unstructured text notes in electronic medical records for phenotyping youth depression , 2017, Evidence Based Journals.

[26]  Chih-Wei Chen,et al.  A context-aware approach for progression tracking of medical concepts in electronic medical records , 2015, J. Biomed. Informatics.

[27]  Haihua Xu,et al.  NLP based congestive heart failure case finding: A prospective analysis on statewide electronic medical records , 2015, Int. J. Medical Informatics.

[28]  I. Kohane,et al.  Methods to Develop an Electronic Medical Record Phenotype Algorithm to Compare the Risk of Coronary Artery Disease across 3 Chronic Disease Cohorts , 2015, PloS one.

[29]  Alejandro Lucia,et al.  Epidemiology of coronary heart disease and acute coronary syndrome. , 2016, Annals of translational medicine.

[30]  Julia O'Rourke,et al.  Linking electronic health record-extracted psychosocial data in real-time to risk of readmission for heart failure. , 2011, Psychosomatics.

[31]  Shahram Ebadollahi,et al.  Prevalence of heart failure signs and symptoms in a large primary care population identified through the use of text and data mining of the electronic health record. , 2014, Journal of cardiac failure.

[32]  Mohammad Khalilia,et al.  Quantifying care coordination using natural language processing and domain-specific ontology , 2015, J. Am. Medical Informatics Assoc..

[33]  Chengyi Zheng,et al.  Extracting data from electronic medical records: validation of a natural language processing program to assess prostate biopsy results , 2013, World Journal of Urology.

[34]  Halil Kilicoglu,et al.  The role of fine-grained annotations in supervised recognition of risk factors for heart disease from EHRs , 2015, J. Biomed. Informatics.

[35]  Søren Brunak,et al.  Using Electronic Patient Records to Discover Disease Correlations and Stratify Patient Cohorts , 2011, PLoS Comput. Biol..

[36]  Andrew Hackbarth,et al.  Defining a Patient Population With Cirrhosis: An Automated Algorithm With Natural Language Processing , 2016, Journal of clinical gastroenterology.

[37]  F. Lai,et al.  Information extraction for tracking liver cancer patients' statuses: from mixture of clinical narrative report types. , 2013, Telemedicine journal and e-health : the official journal of the American Telemedicine Association.

[38]  Shyam Visweswaran,et al.  Automated annotation and classification of BI-RADS assessment from radiology reports , 2017, J. Biomed. Informatics.

[39]  Siddhartha R. Jonnalagadda,et al.  A Natural Language Processing Tool for Large-Scale Data Extraction from Echocardiography Reports , 2016, PloS one.

[40]  Goran Nenadic,et al.  Text mining of cancer-related information: Review of current status and future directions , 2014, Int. J. Medical Informatics.

[41]  Riccardo Bellazzi,et al.  Information extraction from Italian medical reports: An ontology-driven approach , 2018, Int. J. Medical Informatics.

[42]  Tewodros Eguale,et al.  Automated Extraction of VTE Events From Narrative Radiology Reports in Electronic Health Records , 2015, Medical care.

[43]  Louise Deléger,et al.  Increasing the efficiency of trial-patient matching: automated clinical trial eligibility Pre-screening for pediatric oncology patients , 2015, BMC Medical Informatics and Decision Making.

[44]  Joshua C Denny,et al.  Evaluating electronic health record data sources and algorithmic approaches to identify hypertensive individuals , 2017, J. Am. Medical Informatics Assoc..

[45]  Carol Friedman,et al.  Monitoring prescribing patterns using regression and electronic health records , 2017, BMC Medical Informatics and Decision Making.

[46]  Chen Lin,et al.  Towards generalizable entity-centric clinical coreference resolution , 2017, J. Biomed. Informatics.

[47]  Pierre Zweigenbaum,et al.  Text mining applications in psychiatry: a systematic literature review , 2016, International journal of methods in psychiatric research.

[48]  Jimeng Sun,et al.  Automatic identification of heart failure diagnostic criteria, using text analysis of clinical notes from electronic health records , 2014, Int. J. Medical Informatics.

[49]  R. Epstein,et al.  TEPAPA: a novel in silico feature learning pipeline for mining prognostic and associative factors from text-based electronic medical records , 2017, Scientific Reports.

[50]  Hua Xu,et al.  Portability of an algorithm to identify rheumatoid arthritis in electronic health records , 2012, J. Am. Medical Informatics Assoc..

[51]  Manabu Torii,et al.  Risk factor detection for heart disease by applying text analytics in electronic medical records , 2015, J. Biomed. Informatics.

[52]  Noémie Elhadad,et al.  A hybrid knowledge-based and data-driven approach to identifying semantically similar concepts , 2012, J. Biomed. Informatics.

[53]  Sungyoung Lee,et al.  Smart Extraction and Analysis System for Clinical Research. , 2017, Telemedicine journal and e-health : the official journal of the American Telemedicine Association.

[54]  David Martinez,et al.  Facilitating Surveillance of Pulmonary Invasive Mold Diseases in Patients with Haematological Malignancies by Screening Computed Tomography Reports Using Natural Language Processing , 2014, PloS one.

[55]  Clemens Scott Kruse,et al.  Adoption Factors of the Electronic Health Record: A Systematic Review , 2016, JMIR medical informatics.

[56]  Małgorzata Marciniak,et al.  Rule-based information extraction from patients' clinical data , 2009, J. Biomed. Informatics.

[57]  Pradeep Kumar Ray,et al.  Coronary artery disease risk assessment from unstructured electronic health records using text mining , 2015, J. Biomed. Informatics.

[58]  Christopher G Chute,et al.  A high throughput semantic concept frequency based approach for patient identification: a case study using type 2 diabetes mellitus clinical notes. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[59]  Nigam H Shah,et al.  Predictive Modeling of Risk Factors and Complications of Cataract Surgery , 2016, European journal of ophthalmology.

[60]  Goran Nenadic,et al.  Using local lexicalized rules to identify heart disease risk factors in clinical notes , 2015, J. Biomed. Informatics.

[61]  Scott R. Halgrim,et al.  Using natural language processing to improve efficiency of manual chart abstraction in research: the case of breast cancer recurrence. , 2014, American journal of epidemiology.

[62]  Peter Hamilton,et al.  Machine learning classification of surgical pathology reports and chunk recognition for information extraction noise reduction , 2016, Artif. Intell. Medicine.

[63]  Devore S. Culver,et al.  Web-based Real-Time Case Finding for the Population Health Management of Patients With Diabetes Mellitus: A Prospective Validation of the Natural Language Processing–Based Algorithm With Statewide Electronic Medical Records , 2016, JMIR medical informatics.

[64]  S. Trent Rosenbloom,et al.  A comparison of rule-based and machine learning approaches for classifying patient portal messages , 2017, Int. J. Medical Informatics.

[65]  José Luis,et al.  "Support Vector Feature Selection for Early Detection of Anastomosis Leakage from Bag-of-Words in Electronic Health Records" , 2014 .

[66]  Hua Xu,et al.  Extracting timing and status descriptors for colonoscopy testing from electronic medical records , 2010, J. Am. Medical Informatics Assoc..

[67]  Rosa L. Figueroa,et al.  Extracting Information from Electronic Medical Records to Identify the Obesity Status of a Patient Based on Comorbidities and Bodyweight Measures , 2016, Journal of Medical Systems.

[68]  Lei Liu,et al.  Extracting important information from Chinese Operation Notes with natural language processing methods , 2014, J. Biomed. Informatics.

[69]  Joshua C. Denny,et al.  Combining billing codes, clinical notes, and medications from electronic health records provides superior phenotyping performance , 2016, J. Am. Medical Informatics Assoc..

[70]  Jonathan M. Garibaldi,et al.  A hybrid model for automatic identification of risk factors for heart disease , 2015, J. Biomed. Informatics.

[71]  William W. Boonn,et al.  Development of Automated Detection of Radiology Reports Citing Adrenal Findings , 2011, Journal of Digital Imaging.

[72]  Joe Kesterson,et al.  Natural language processing for the development of a clinical registry: a validation study in intraductal papillary mucinous neoplasms. , 2010, HPB : the official journal of the International Hepato Pancreato Biliary Association.

[73]  Nigam H. Shah,et al.  Practice-Based Evidence: Profiling the Safety of Cilostazol by Text-Mining of Clinical Notes , 2013, PloS one.

[74]  Stephen B. Johnson,et al.  A review of approaches to identifying patient phenotype cohorts using electronic health records , 2013, J. Am. Medical Informatics Assoc..

[75]  Abhishek Pandey,et al.  Natural language processing systems for capturing and standardizing unstructured clinical information: A systematic review , 2017, J. Biomed. Informatics.

[76]  Galia Angelova,et al.  Text Mining and Big Data Analytics for Retrospective Analysis of Clinical Texts from Outpatient Care , 2015 .

[77]  Meliha Yetisgen-Yildiz,et al.  Tumor reference resolution and characteristic extraction in radiology reports for liver cancer stage prediction , 2016, J. Biomed. Informatics.

[78]  Sunghwan Sohn,et al.  Mining peripheral arterial disease cases from narrative clinical notes using natural language processing , 2017, Journal of vascular surgery.

[79]  Jin Fan,et al.  Leveraging informatics for genetic studies: use of the electronic medical record to enable a genome-wide association study of peripheral arterial disease , 2010, J. Am. Medical Informatics Assoc..

[80]  David W. Bates,et al.  Use of electronic healthcare records to identify complex patients with atrial fibrillation for targeted intervention , 2016, J. Am. Medical Informatics Assoc..

[81]  G. Barnett,et al.  DXplain. An evolving diagnostic decision-support system. , 1987, JAMA.

[82]  Ming Li,et al.  Natural Language Processing Improves Identification of Colorectal Cancer Testing in the Electronic Medical Record , 2012, Medical decision making : an international journal of the Society for Medical Decision Making.

[83]  Li Li,et al.  Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records , 2016, Scientific Reports.

[84]  Goran Nenadic,et al.  A text mining approach to the prediction of disease status from clinical discharge summaries. , 2009, Journal of the American Medical Informatics Association : JAMIA.

[85]  A. Doney,et al.  Automated data capture from free‐text radiology reports to enhance accuracy of hospital inpatient stroke codes , 2010, Pharmacoepidemiology and drug safety.

[86]  Peter Szolovits,et al.  Automatic lymphoma classification with sentence subgraph mining from pathology reports. , 2014, Journal of the American Medical Informatics Association : JAMIA.

[87]  Wei Chen,et al.  The utility of including pathology reports in improving the computational identification of patients , 2016, Journal of pathology informatics.

[88]  Anne E Carpenter,et al.  Opportunities and obstacles for deep learning in biology and medicine , 2017, bioRxiv.

[89]  Özlem Uzuner,et al.  Automatic prediction of coronary artery disease from clinical narratives , 2017, J. Biomed. Informatics.

[90]  Vasudevan Jagannathan,et al.  Natural language processing framework to assess clinical conditions. , 2009, Journal of the American Medical Informatics Association : JAMIA.

[91]  Anita Burgun-Parenthoine,et al.  Improving a full-text search engine: the importance of negation detection and family history context to identify cases in a biomedical data warehouse , 2017, J. Am. Medical Informatics Assoc..

[92]  Selen Bozkurt,et al.  Automatic abstraction of imaging observations with their characteristics from mammography reports , 2015, J. Am. Medical Informatics Assoc..

[93]  Sophia Ananiadou,et al.  Mapping Phenotypic Information in Heterogeneous Textual Sources to a Domain-Specific Terminological Resource , 2015, PloS one.

[94]  Meliha Yetisgen-Yildiz,et al.  Classifying tumor event attributes in radiology reports , 2017, J. Assoc. Inf. Sci. Technol..

[95]  Hongfang Liu,et al.  Journal of Biomedical Informatics , 2022 .

[96]  Serguei V. S. Pakhomov,et al.  Electronic medical records for clinical research: application to the identification of heart failure. , 2007, The American journal of managed care.

[97]  Adam Wright,et al.  Use of a support vector machine for categorizing free-text notes: assessment of accuracy across two institutions , 2013, J. Am. Medical Informatics Assoc..

[98]  Nigam Shah,et al.  Statin Intensity or Achieved LDL? Practice-based Evidence for the Evaluation of New Cholesterol Treatment Guidelines , 2016, PloS one.

[99]  John H. Holmes,et al.  Text mining applied to electronic cardiovascular procedure reports to identify patients with trileaflet aortic stenosis and coronary artery disease , 2017, J. Biomed. Informatics.

[100]  Kavita Radhakrishnan,et al.  Studying Associations Between Heart Failure Self-Management and Rehospitalizations Using Natural Language Processing , 2017, Western journal of nursing research.

[101]  Serguei V. S. Pakhomov,et al.  Technical Brief: Automatic Classification of Foot Examination Findings Using Clinical Notes and Machine Learning , 2008, J. Am. Medical Informatics Assoc..

[102]  Hong-Jie Dai,et al.  Identification and Progression of Heart Disease Risk Factors in Diabetic Patients from Longitudinal Electronic Health Records , 2015, BioMed research international.

[103]  Karen A Robinson,et al.  Clinical review: Prevalence and incidence of endocrine and metabolic disorders in the United States: a comprehensive review. , 2009, The Journal of clinical endocrinology and metabolism.

[104]  Lisa Dahm,et al.  University of California, Irvine–Pathology Extraction Pipeline: The pathology extraction pipeline for information extraction from pathology reports , 2014, Health Informatics J..

[105]  Pradeep Kumar Ray,et al.  HTNSystem: Hypertension Information Extraction System for Unstructured Clinical Notes , 2014, TAAI.

[106]  Bo Jin,et al.  Prediction of Incident Hypertension Within the Next Year: Prospective Study Using Statewide Electronic Health Records and Machine Learning , 2018, Journal of medical Internet research.

[107]  D. Moher,et al.  Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement , 2009, BMJ.

[108]  Fei Wang,et al.  Deep learning for healthcare: review, opportunities and challenges , 2018, Briefings Bioinform..

[109]  Jon Patrick,et al.  Automatic Structured Reporting from Narrative Cancer Pathology Reports , 2014 .

[110]  Scott L. DuVall,et al.  Unlocking echocardiogram measurements for heart disease research through natural language processing , 2017, BMC Cardiovascular Disorders.

[111]  Christopher J. Vitale,et al.  Representation of Information about Family Relatives as Structured Data in Electronic Health Records , 2014, Applied Clinical Informatics.

[112]  John P. A. Ioannidis,et al.  Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review , 2017, J. Am. Medical Informatics Assoc..

[113]  Donia Scott,et al.  Extracting information from the text of electronic medical records to improve case detection: a systematic review , 2016, J. Am. Medical Informatics Assoc..

[114]  M. Shamim Hossain,et al.  Multiple Disease Risk Assessment With Uniform Model Based on Medical Clinical Notes , 2016, IEEE Access.

[115]  Anthony P Nunes,et al.  Assessing occurrence of hypoglycemia and its severity from electronic health records of patients with type 2 diabetes mellitus. , 2016, Diabetes research and clinical practice.

[116]  S. Brunak,et al.  Mining electronic health records: towards better research applications and clinical care , 2012, Nature Reviews Genetics.

[117]  Jon Patrick,et al.  Automatic negation detection in narrative pathology reports , 2015, Artif. Intell. Medicine.

[118]  Ankur Agarwal,et al.  Big data driven co-occurring evidence discovery in chronic obstructive pulmonary disease patients , 2017, Journal of Big Data.

[119]  Clement J. McDonald,et al.  Research and applications: Combining structured and unstructured data to identify a cohort of ICU patients who received dialysis , 2014, J. Am. Medical Informatics Assoc..