Practice of Epidemiology Using Natural Language Processing to Improve Efficiency of Manual Chart Abstraction in Research : The Case of Breast Cancer Recurrence

The increasing availability of electronic health records (EHRs) creates opportunities for automated extraction of information from clinical text. We hypothesized that natural language processing (NLP) could substantially reduce the burden of manual abstraction in studies examining outcomes, like cancer recurrence, that are documented in unstructured clinical text, such as progress notes, radiology reports, and pathology reports. We developed an NLPbased system using open-source software to process electronic clinical notes from 1995 to 2012 for women with early-stage incident breast cancers to identify whether and when recurrences were diagnosed. We developed and evaluated the system using clinical notes from 1,472 patients receiving EHR-documented care in an integrated health care system in the Pacific Northwest. A separate study provided the patient-level reference standard for recurrence status and date. The NLP-based system correctly identified 92% of recurrences and estimated diagnosis dates within 30 days for 88%of these. Specificity was 96%. TheNLP-based system overlooked 5 of 65 recurrences, 4 because electronic documents were unavailable. The NLP-based system identified 5 other recurrences incorrectly classified as nonrecurrent in the reference standard. If used in similar cohorts, NLP could reduce by 90% the number of EHR charts abstracted to identify confirmed breast cancer recurrence cases at a rate comparable to traditional abstraction.

[1]  W. DuMouchel,et al.  Unlocking Clinical Data from Narrative Reports: A Study of Natural Language Processing , 1995, Annals of Internal Medicine.

[2]  G Hripcsak,et al.  Natural language processing and its future in medicine. , 1999, Academic medicine : journal of the Association of American Medical Colleges.

[3]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[4]  Peter J. Haug,et al.  A Comparison of Classification Algorithms to Automatically Identify Chest X-Ray Reports That Support Pneumonia , 2001, J. Biomed. Informatics.

[5]  J. Austin,et al.  Use of natural language processing to translate clinical information from a database of 889,921 chest radiographic reports. , 2002, Radiology.

[6]  T. Lash,et al.  Breast cancer treatment of older women in integrated health care settings. , 2006, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[7]  T. Lumley,et al.  Breast cancer recurrence risk in relation to antidepressant use after diagnosis , 2008, Breast Cancer Research and Treatment.

[8]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[9]  David A. Hanauer,et al.  Enhanced identification of eligibility for depression research using an electronic medical record search engine , 2009, Int. J. Medical Informatics.

[10]  Guergana K. Savova,et al.  Discerning Tumor Status from Unstructured MRI Reports—Completeness of Information in Existing Reports and Utility of Automated Natural Language Processing , 2009, Journal of Digital Imaging.

[11]  James W. Cooper,et al.  Automatically extracting cancer disease characteristics from pathology reports into a Disease Knowledge Representation Model , 2009, J. Biomed. Informatics.

[12]  B. Dean,et al.  Review: Use of Electronic Medical Records for Health Outcomes Research , 2009, Medical care research and review : MCRR.

[13]  D. Buist,et al.  Referral, receipt, and completion of chemotherapy in patients with early-stage breast cancer older than 65 years and at high risk of breast cancer recurrence. , 2009, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[14]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[15]  Michael Feldman,et al.  caTIES: a grid based system for coding and retrieval of surgical pathology reports and tissue specimens in support of translational research , 2010, J. Am. Medical Informatics Assoc..

[16]  I. Kohane,et al.  Electronic medical records for discovery research in rheumatoid arthritis , 2010, Arthritis care & research.

[17]  Jin Fan,et al.  Leveraging informatics for genetic studies: use of the electronic medical record to enable a genome-wide association study of peripheral arterial disease , 2010, J. Am. Medical Informatics Assoc..

[18]  Christopher G Chute,et al.  Discovering peripheral arterial disease cases from radiology notes using natural language processing. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[19]  D. Roden,et al.  The Emerging Role of Electronic Medical Records in Pharmacogenomics , 2011, Clinical pharmacology and therapeutics.

[20]  M. Fava,et al.  Using electronic medical records to enable large-scale studies in psychiatry: treatment resistant depression as a model , 2011, Psychological Medicine.

[21]  C. Chute,et al.  Electronic Medical Records for Genetic Research: Results of the eMERGE Consortium , 2011, Science Translational Medicine.

[22]  Lucila Ohno-Machado,et al.  Natural language processing: an introduction , 2011, J. Am. Medical Informatics Assoc..

[23]  A. Jha,et al.  The promise of electronic records: around the corner or down the road? , 2011, JAMA.

[24]  Steven H. Brown,et al.  Automated identification of postoperative complications within an electronic medical record using natural language processing. , 2011, JAMA.

[25]  I. Kohane Using electronic health records to drive discovery in disease genomics , 2011, Nature Reviews Genetics.

[26]  Jessica Chubak,et al.  Tradeoffs between accuracy measures for electronic health care data algorithms. , 2012, Journal of clinical epidemiology.

[27]  Jessica Chubak,et al.  Administrative data algorithms to identify second breast cancer events following early-stage invasive breast cancer. , 2012, Journal of the National Cancer Institute.

[28]  Lin Chen,et al.  Importance of multi-modal approaches to effectively identify cataract cases from electronic health records , 2012, J. Am. Medical Informatics Assoc..

[29]  Bruce M Psaty,et al.  Use of administrative data to estimate the incidence of statin-related rhabdomyolysis. , 2012, JAMA.

[30]  Judith W. Dexheimer,et al.  Natural Language Processing – The Basics , 2012 .

[31]  Ming Li,et al.  Natural Language Processing Improves Identification of Colorectal Cancer Testing in the Electronic Medical Record , 2012, Medical decision making : an international journal of the Society for Medical Decision Making.

[32]  Christopher G. Chute,et al.  Automated discovery of drug treatment patterns for endocrine therapy of breast cancer within an electronic medical record , 2012, J. Am. Medical Informatics Assoc..

[33]  D. Buist,et al.  Frequent Antibiotic Use and Second Breast Cancer Events , 2013, Cancer Epidemiology, Biomarkers & Prevention.

[34]  Justin A. Strauss,et al.  Identifying primary and recurrent cancers using a SAS-based natural language processing algorithm , 2012, J. Am. Medical Informatics Assoc..

[35]  Nikki M. Carroll,et al.  Validating Billing/Encounter Codes as Indicators of Lung, Colorectal, Breast, and Prostate Cancer Recurrence Using 2 Large Contemporary Cohorts , 2014, Medical care.

[36]  Judith W. Dexheimer,et al.  Natural Language Processing: Applications in Pediatric Research , 2016 .