Information Extraction for Populating Lung Cancer Clinical Research Data

Lung cancer is the second most common cancer and the wide adoption of electronic health records (EHRs) offers a potential of accelerating cohort-related epidemiological studies using informatics approaches. In this study, we developed and evaluated a natural language processing (NLP) system to extract information on stage, histology, grade and therapies (chemotherapy, radiotherapy and surgery) automatically for lung cancer patients from clinical narratives including clinical notes, pathology reports and surgery reports. Evaluation showed promising results with the recalls for stage, histology, grade, and therapies achieving 89%, 98%, 80%, and 100% respectively and the precisions were 71%, 89%, 90%, and 100% respectively. This study demonstrated the feasibility and accuracy of extracting related information from clinical narratives for lung cancer research.