Natural language processing for populating lung cancer clinical research data

Lung cancer is the second most common cancer for men and women; the wide adoption of electronic health records (EHRs) offers a potential to accelerate cohort-related epidemiological studies using informatics approaches. Since manual extraction from large volumes of text materials is time consuming and labor intensive, some efforts have emerged to automatically extract information from text for lung cancer patients using natural language processing (NLP), an artificial intelligence technique. In this study, using an existing cohort of 2311 lung cancer patients with information about stage, histology, tumor grade, and therapies (chemotherapy, radiotherapy and surgery) manually ascertained, we developed and evaluated an NLP system to extract information on these variables automatically for the same patients from clinical narratives including clinical notes, pathology reports and surgery reports. Evaluation showed promising results with the recalls for stage, histology, tumor grade, and therapies achieving 89, 98, 78, and 100% respectively and the precisions were 70, 88, 90, and 100% respectively. This study demonstrated the feasibility and accuracy of automatically extracting pre-defined information from clinical narratives for lung cancer research.

[1]  M. Levy,et al.  ReCAP: Feasibility and Accuracy of Extracting Cancer Stage Information From Narrative Electronic Health Record Data. , 2016, Journal of oncology practice.

[2]  Hongfang Liu,et al.  A Comparison of Word Embeddings for the Biomedical Natural Language Processing , 2018, J. Biomed. Informatics.

[3]  D. Ettinger,et al.  Survival by histologic subtype in stage IV nonsmall cell lung cancer based on data from the Surveillance, Epidemiology and End Results Program , 2011, Clinical epidemiology.

[4]  Ergin Soysal,et al.  Identifying Metastases-related Information from Pathology Reports of Lung Cancer Patients , 2017, CRI.

[5]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[6]  Anthony N. Nguyen,et al.  Symbolic rule-based classification of lung cancer stages from free-text pathology reports , 2010, J. Am. Medical Informatics Assoc..

[7]  Robert Pirker,et al.  Targeted Therapies for Lung Cancer , 2018 .

[8]  Timothy A. Miller,et al.  DeepPhe: A Natural Language Processing System for Extracting Cancer Phenotypes from Clinical Records. , 2017, Cancer research.

[9]  L. Chirieac,et al.  Prognostic significance of grading in lung adenocarcinoma , 2010, Cancer.

[10]  Ping Yang,et al.  Epidemiology of lung cancer prognosis: quantity and quality of life. , 2009, Methods in molecular biology.

[11]  Yuqi Si,et al.  A Frame-Based NLP System for Cancer-Related Information Extraction , 2018, AMIA.

[12]  Christopher G Chute,et al.  An Information Extraction Framework for Cohort Identification Using Electronic Health Records , 2013, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[13]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[14]  Wei Dong,et al.  Appropriate surgical modalities for stages T2a and T2b in the eighth TNM classification of lung cancer , 2017, Scientific Reports.

[15]  Hongfang Liu,et al.  Journal of Biomedical Informatics , 2022 .

[16]  J. Austin,et al.  The 2015 World Health Organization Classification of Lung Tumors: Impact of Genetic, Clinical and Radiologic Advances Since the 2004 Classification. , 2015, Journal of thoracic oncology : official publication of the International Association for the Study of Lung Cancer.

[17]  Scott R. Halgrim,et al.  Using natural language processing to improve efficiency of manual chart abstraction in research: the case of breast cancer recurrence. , 2014, American journal of epidemiology.

[18]  N. Adler,et al.  Using Electronic Health Records for Population Health Research: A Review of Methods and Applications. , 2016, Annual review of public health.

[19]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[20]  Fusheng Wang,et al.  Automated Information Extraction on Treatment and Prognosis for Non–Small Cell Lung Cancer Radiotherapy Patients: Clinical Study , 2018, JMIR medical informatics.