Using natural language processing to improve efficiency of manual chart abstraction in research: the case of breast cancer recurrence.

The increasing availability of electronic health records (EHRs) creates opportunities for automated extraction of information from clinical text. We hypothesized that natural language processing (NLP) could substantially reduce the burden of manual abstraction in studies examining outcomes, like cancer recurrence, that are documented in unstructured clinical text, such as progress notes, radiology reports, and pathology reports. We developed an NLP-based system using open-source software to process electronic clinical notes from 1995 to 2012 for women with early-stage incident breast cancers to identify whether and when recurrences were diagnosed. We developed and evaluated the system using clinical notes from 1,472 patients receiving EHR-documented care in an integrated health care system in the Pacific Northwest. A separate study provided the patient-level reference standard for recurrence status and date. The NLP-based system correctly identified 92% of recurrences and estimated diagnosis dates within 30 days for 88% of these. Specificity was 96%. The NLP-based system overlooked 5 of 65 recurrences, 4 because electronic documents were unavailable. The NLP-based system identified 5 other recurrences incorrectly classified as nonrecurrent in the reference standard. If used in similar cohorts, NLP could reduce by 90% the number of EHR charts abstracted to identify confirmed breast cancer recurrence cases at a rate comparable to traditional abstraction.

[1]  Nikki M. Carroll,et al.  Validating Billing/Encounter Codes as Indicators of Lung, Colorectal, Breast, and Prostate Cancer Recurrence Using 2 Large Contemporary Cohorts , 2014, Medical care.

[2]  D. Buist,et al.  Frequent Antibiotic Use and Second Breast Cancer Events , 2013, Cancer Epidemiology, Biomarkers & Prevention.

[3]  Justin A. Strauss,et al.  Identifying primary and recurrent cancers using a SAS-based natural language processing algorithm , 2012, J. Am. Medical Informatics Assoc..

[4]  Jessica Chubak,et al.  Administrative data algorithms to identify second breast cancer events following early-stage invasive breast cancer. , 2012, Journal of the National Cancer Institute.

[5]  Christopher G. Chute,et al.  Automated discovery of drug treatment patterns for endocrine therapy of breast cancer within an electronic medical record , 2012, J. Am. Medical Informatics Assoc..

[6]  Bruce M Psaty,et al.  Use of administrative data to estimate the incidence of statin-related rhabdomyolysis. , 2012, JAMA.

[7]  Lin Chen,et al.  Importance of multi-modal approaches to effectively identify cataract cases from electronic health records , 2012, J. Am. Medical Informatics Assoc..

[8]  Jessica Chubak,et al.  Tradeoffs between accuracy measures for electronic health care data algorithms. , 2012, Journal of clinical epidemiology.

[9]  Ming Li,et al.  Natural Language Processing Improves Identification of Colorectal Cancer Testing in the Electronic Medical Record , 2012, Medical decision making : an international journal of the Society for Medical Decision Making.

[10]  Lucila Ohno-Machado,et al.  Natural language processing: an introduction , 2011, J. Am. Medical Informatics Assoc..

[11]  Steven H. Brown,et al.  Automated identification of postoperative complications within an electronic medical record using natural language processing. , 2011, JAMA.

[12]  A. Jha,et al.  The promise of electronic records: around the corner or down the road? , 2011, JAMA.

[13]  M. Fava,et al.  Using electronic medical records to enable large-scale studies in psychiatry: treatment resistant depression as a model , 2011, Psychological Medicine.

[14]  I. Kohane Using electronic health records to drive discovery in disease genomics , 2011, Nature Reviews Genetics.

[15]  C. Chute,et al.  Electronic Medical Records for Genetic Research: Results of the eMERGE Consortium , 2011, Science Translational Medicine.

[16]  D. Roden,et al.  The Emerging Role of Electronic Medical Records in Pharmacogenomics , 2011, Clinical pharmacology and therapeutics.

[17]  Christopher G Chute,et al.  Discovering peripheral arterial disease cases from radiology notes using natural language processing. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[18]  Jin Fan,et al.  Leveraging informatics for genetic studies: use of the electronic medical record to enable a genome-wide association study of peripheral arterial disease , 2010, J. Am. Medical Informatics Assoc..

[19]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[20]  Michael Feldman,et al.  caTIES: a grid based system for coding and retrieval of surgical pathology reports and tissue specimens in support of translational research , 2010, J. Am. Medical Informatics Assoc..

[21]  I. Kohane,et al.  Electronic medical records for discovery research in rheumatoid arthritis , 2010, Arthritis care & research.

[22]  David A. Hanauer,et al.  Enhanced identification of eligibility for depression research using an electronic medical record search engine , 2009, Int. J. Medical Informatics.

[23]  James W. Cooper,et al.  Automatically extracting cancer disease characteristics from pathology reports into a Disease Knowledge Representation Model , 2009, J. Biomed. Informatics.

[24]  D. Buist,et al.  Referral, receipt, and completion of chemotherapy in patients with early-stage breast cancer older than 65 years and at high risk of breast cancer recurrence. , 2009, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[25]  Guergana K. Savova,et al.  Discerning Tumor Status from Unstructured MRI Reports—Completeness of Information in Existing Reports and Utility of Automated Natural Language Processing , 2009, Journal of Digital Imaging.

[26]  B. Dean,et al.  Review: Use of Electronic Medical Records for Health Outcomes Research , 2009, Medical care research and review : MCRR.

[27]  T. Lumley,et al.  Breast cancer recurrence risk in relation to antidepressant use after diagnosis , 2008, Breast Cancer Research and Treatment.

[28]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[29]  T. Lash,et al.  Breast cancer treatment of older women in integrated health care settings. , 2006, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[30]  J. Austin,et al.  Use of natural language processing to translate clinical information from a database of 889,921 chest radiographic reports. , 2002, Radiology.

[31]  Peter J. Haug,et al.  A Comparison of Classification Algorithms to Automatically Identify Chest X-Ray Reports That Support Pneumonia , 2001, J. Biomed. Informatics.

[32]  G Hripcsak,et al.  Natural language processing and its future in medicine. , 1999, Academic medicine : journal of the Association of American Medical Colleges.

[33]  W. DuMouchel,et al.  Unlocking Clinical Data from Narrative Reports: A Study of Natural Language Processing , 1995, Annals of Internal Medicine.

[34]  Judith W. Dexheimer,et al.  Natural Language Processing: Applications in Pediatric Research , 2016 .

[35]  Judith W. Dexheimer,et al.  Natural Language Processing – The Basics , 2012 .

[36]  Ruprecht Raatz,et al.  Microsoft SQL Server 2008 , 2008 .

[37]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .