Natural language processing to identify lupus nephritis phenotype in electronic health records

Systemic lupus erythematosus (SLE) is a rare autoimmune disorder characterized by an unpredictable course of flares and remission with diverse manifestations. Lupus nephritis, one of the major disease manifestations of SLE for organ damage and mortality, is a key component of lupus classification criteria. Accurately identifying lupus nephritis in electronic health records (EHRs) would therefore benefit large cohort observational studies and clinical trials where characterization of the patient population is critical for recruitment, study design, and analysis. Lupus nephritis can be recognized through procedure codes and structured data, such as laboratory tests. However, other critical information documenting lupus nephritis, such as histologic reports from kidney biopsies and prior medical history narratives, require sophisticated text processing to mine information from pathology reports and clinical notes. In this study, we developed algorithms to identify lupus nephritis with and without natural language processing (NLP) using EHR data from the Northwestern Medicine Enterprise Data Warehouse (NMEDW). We developed four algorithms: a rule-based algorithm using only structured data (baseline algorithm) and three algorithms using different NLP models. The three NLP models are based on regularized logistic regression and use different sets of features including positive mention of concept unique identifiers (CUIs), number of appearances of CUIs, and a mixture of three components (i.e. a curated list of CUIs, regular expression concepts, structured data) respectively. The baseline algorithm and the best performed NLP algorithm were external validated on a dataset from Vanderbilt University Medical Center (VUMC). Our best performing NLP model incorporating features from both structured data, regular expression concepts, and mapped concept unique identifiers (CUIs) improved F measure in both the NMEDW (0.41 vs 0.79) and VUMC (0.62 vs 0.96) datasets compared to the baseline lupus nephritis algorithm.

[1]  Paul J. Hoover,et al.  Insights into the epidemiology and management of lupus nephritis from the US rheumatologist's perspective. , 2016, Kidney international.

[2]  Cynna Selvy,et al.  Unified Medical Language System (UMLS) , 2015 .

[3]  Yuan Luo,et al.  Using natural language processing and machine learning to identify breast cancer local recurrence , 2018, BMC Bioinformatics.

[4]  Yuan Luo,et al.  Identifying Breast Cancer Distant Recurrences from Electronic Health Records Using Machine Learning , 2019, Journal of Healthcare Informatics Research.

[5]  Peter Szolovits,et al.  Bridging semantics and syntax with graph algorithms - state-of-the-art of extracting biomedical relations , 2017, Briefings Bioinform..

[6]  Yuan Luo,et al.  Contralateral Breast Cancer Event Detection Using Nature Language Processing , 2017, AMIA.

[7]  Yuan Luo Evaluating the state-of-the-art in missing data imputation for clinical data (Preprint) , 2021 .

[8]  Xiaoyu Li,et al.  Natural Language Processing for EHR-Based Computational Phenotyping , 2018, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  T. Dörner,et al.  Novel paradigms in systemic lupus erythematosus , 2019, The Lancet.

[10]  M. Infantino,et al.  European League against Rheumatism/American College of Rheumatology classification criteria for systemic lupus erythematosus: the laboratory immunologist’s point of view , 2019, Annals of the Rheumatic Diseases.

[11]  Gerald McGwin,et al.  Derivation and validation of the Systemic Lupus International Collaborating Clinics classification criteria for systemic lupus erythematosus. , 2012, Arthritis and rheumatism.

[12]  D. Isenberg,et al.  New therapies for systemic lupus erythematosus — past imperfect, future tense , 2019, Nature Reviews Rheumatology.

[13]  L. Chibnik,et al.  Identification and validation of lupus nephritis cases using administrative data , 2010, Lupus.

[14]  Yan Xie,et al.  Development and validation of lupus nephritis case definitions using United States veterans affairs electronic health records , 2020, Lupus.

[15]  Peter Szolovits,et al.  3D-MICE: integration of cross-sectional and longitudinal imputation for multi-analyte longitudinal clinical data , 2017, J. Am. Medical Informatics Assoc..

[16]  D. Gladman,et al.  2019 European League Against Rheumatism/American College of Rheumatology Classification Criteria for Systemic Lupus Erythematosus , 2019, Arthritis & rheumatology.

[17]  George Hripcsak,et al.  Characterizing Design Patterns of EHR-Driven Phenotype Extraction Algorithms , 2018, 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[18]  N. Sathe,et al.  A systematic review of validated methods for identifying systemic lupus erythematosus (SLE) using administrative or claims data. , 2013, Vaccine.

[19]  M. Hochberg,et al.  Updating the American College of Rheumatology revised criteria for the classification of systemic lupus erythematosus. , 1997, Arthritis and rheumatism.

[20]  B. Rovin,et al.  Update on Lupus Nephritis. , 2017, Clinical journal of the American Society of Nephrology : CJASN.

[21]  I. Kohane,et al.  Development of phenotype algorithms using electronic medical records and incorporating natural language processing , 2015, BMJ : British Medical Journal.

[22]  Gerard Tromp,et al.  Design patterns for the development of electronic health record-driven phenotype extraction algorithms , 2014, J. Biomed. Informatics.