SODA: A Natural Language Processing Package to Extract Social Determinants of Health for Cancer Studies

Objective: We aim to develop an open-source natural language processing (NLP) package, SODA (i.e., SOcial DeterminAnts), with pre-trained transformer models to extract social determinants of health (SDoH) for cancer patients, examine the generalizability of SODA to a new disease domain (i.e., opioid use), and evaluate the extraction rate of SDoH using cancer populations. Methods: We identified SDoH categories and attributes and developed an SDoH corpus using clinical notes from a general cancer cohort. We compared four transformer-based NLP models to extract SDoH, examined the generalizability of NLP models to a cohort of patients prescribed with opioids, and explored customization strategies to improve performance. We applied the best NLP model to extract 19 categories of SDoH from the breast (n=7,971), lung (n=11,804), and colorectal cancer (n=6,240) cohorts. Results and Conclusion: We developed a corpus of 629 cancer patients notes with annotations of 13,193 SDoH concepts/attributes from 19 categories of SDoH. The Bidirectional Encoder Representations from Transformers (BERT) model achieved the best strict/lenient F1 scores of 0.9216 and 0.9441 for SDoH concept extraction, 0.9617 and 0.9626 for linking attributes to SDoH concepts. Fine-tuning the NLP models using new annotations from opioid use patients improved the strict/lenient F1 scores from 0.8172/0.8502 to 0.8312/0.8679. The extraction rates among 19 categories of SDoH varied greatly, where 10 SDoH could be extracted from>70% of cancer patients, but 9 SDoH had a low extraction rate (<70% of cancer patients). The SODA package with pre-trained transformer models is publicly available at https://github.com/uf-hobiinformatics-lab/SDoH_SODA.

[1]  Özlem Uzuner,et al.  The 2022 n2c2/UW Shared Task on Extracting Social Determinants of Health , 2023, J. Am. Medical Informatics Assoc..

[2]  Colin B. Compas,et al.  A large language model for electronic health records , 2022, npj Digital Medicine.

[3]  Thomas J. George,et al.  Barriers and Facilitators of Obtaining Social Determinants of Health of Patients With Cancer Through the Electronic Health Record Using Natural Language Processing Technology: Qualitative Feasibility Study With Stakeholder Interviews , 2022, JMIR formative research.

[4]  Yi Guo,et al.  Assessing the Documentation of Social Determinants of Health for Lung Cancer Patients in Clinical Narratives , 2022, Frontiers in Public Health.

[5]  Yi Guo,et al.  Abstract P108: Natural Language Processing Extracted Social And Behavioral Determinants Of Health And Newer Glucose-lowering Drug Initiation Among Real-world Patients With Type 2 Diabetes , 2022, Circulation.

[6]  P. Rangachari,et al.  The relationship between Social Determinants of Health (SDoH) and death from cardiovascular disease or opioid use in counties across the United States (2009–2018) , 2022, BMC Public Health.

[7]  Masoud Rouhizadeh,et al.  Development and assessment of a natural language processing model to identify residential instability in electronic health records’ unstructured data: a comparison of 3 integrated healthcare delivery systems , 2022, JAMIA open.

[8]  Robert F. Zhang,et al.  Classifying social determinants of health from unstructured electronic health records using deep learning-based natural language processing , 2022, J. Biomed. Informatics.

[9]  Braja Gopal Patra,et al.  Extracting social determinants of health from electronic health records using natural language processing: a systematic review , 2021, J. Am. Medical Informatics Assoc..

[10]  Jiang Bian,et al.  A Study of Social and Behavioral Determinants of Health in Lung Cancer Patients Using Transformers-based Natural Language Processing Models , 2021, AMIA.

[11]  D. Albright,et al.  Social Determinants of Opioid Use among Patients in Rural Primary Care Settings , 2021, Social work in public health.

[12]  Jaime Arguello,et al.  Identification of social determinants of health using multi-label classification of electronic health record clinical notes. , 2021, JAMIA open.

[13]  Jianfeng Gao,et al.  DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[14]  Mari Ostendorf,et al.  Annotating Social Determinants of Health Using Active Learning, and Characterizing Determinants Using Neural Event Extraction , 2020, J. Biomed. Informatics.

[15]  Yonghui Wu,et al.  International Classification of Diseases, Tenth Revision, Clinical Modification social determinants of health codes are poorly used in electronic health records , 2020, Medicine.

[16]  Yonghui Wu,et al.  Clinical concept extraction using transformers , 2020, J. Am. Medical Informatics Assoc..

[17]  R. Cantu,et al.  Applying a Social Determinants of Health Approach to the Opioid Epidemic , 2020, Health promotion practice.

[18]  Umit Topaloglu,et al.  Extracting Smoking Status from Electronic Health Records Using NLP and Deep Learning. , 2020, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[19]  D. Patel,et al.  Social Determinants of Health and Severe Maternal Morbidity During Delivery Hospitalizations in Texas [36L] , 2020 .

[20]  T. Brown,et al.  Social Determinants of Health and 90‐Day Mortality After Hospitalization for Heart Failure in the REGARDS Study , 2020, Journal of the American Heart Association.

[21]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[22]  Daniel J. Feller,et al.  Detecting Social and Behavioral Determinants of Health with Structured and Free-Text Clinical Data , 2020, Applied Clinical Informatics.

[23]  Xi Yang,et al.  Identifying relations of medications with adverse drug events using recurrent convolutional neural networks and gradient boosting , 2019, J. Am. Medical Informatics Assoc..

[24]  Masoud Rouhizadeh,et al.  Assessing the Availability of Data on Social and Behavioral Determinants in Structured and Unstructured Electronic Health Records: A Retrospective Analysis of a Multilevel Health Care System , 2019, JMIR medical informatics.

[25]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[26]  Benjamin Lê Cook,et al.  Identification of suicidal behavior among psychiatrically hospitalized adolescents using natural language processing and machine learning of electronic health records , 2019, PloS one.

[27]  Jianlin Shi,et al.  Determination of Marital Status of Patients from Structured and Unstructured Electronic Healthcare Data , 2019, AMIA.

[28]  Yonghui Wu,et al.  MADEx: A System for Detecting Medications, Adverse Drug Events, and Their Relations from Clinical Notes , 2019, Drug Safety.

[29]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[30]  S. Velupillai,et al.  Identifying Suicide Ideation and Suicidal Attempts in a Psychiatric Clinical Research Database using Natural Language Processing , 2018, Scientific Reports.

[31]  Franck Dernoncourt,et al.  Comparing deep learning and concept extraction based methods for patient phenotyping from clinical narratives , 2018, PloS one.

[32]  A. Matthews,et al.  Social Determinants of LGBT Cancer Health Inequities. , 2018, Seminars in oncology nursing.

[33]  Noémie Elhadad,et al.  Towards the Inference of Social and Behavioral Determinants of Sexual Health: Development of a Gold-Standard Corpus with Semi-Supervised Learning , 2018, AMIA.

[34]  Lucy Vanderwende,et al.  Automatic Identification of Substance Abuse from Social History in Clinical Text , 2017, AIME.

[35]  Gopal K Singh,et al.  Social Determinants of Health in the United States: Addressing Major Health Inequality Trends for the Nation, 1935-2016 , 2017, International journal of MCH and AIDS.

[36]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[37]  Ani Nenkova,et al.  Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , 2016, NAACL 2016.

[38]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[39]  Serguei V. S. Pakhomov,et al.  Automated Extraction of Substance Use Information from Clinical Texts , 2015, AMIA.

[40]  Dezon Finch,et al.  Using Information from the Electronic Health Record to Improve Measurement of Unemployment in Service Members and Veterans with mTBI and Post-Deployment Stress , 2014, PloS one.

[41]  Xinguang Chen,et al.  Smoking initiation associated with specific periods in the life course from birth to young adulthood: data from the National Longitudinal Survey of Youth 1997. , 2014, American journal of public health.

[42]  Laura Gottlieb,et al.  The Social Determinants of Health: It's Time to Consider the Causes of the Causes , 2014, Public health reports.

[43]  Shuying Shen,et al.  Using Natural Language Processing on the Free Text of Clinical Documents to Screen for Evidence of Homelessness Among US Veterans , 2013, AMIA.

[44]  Sandro Galea,et al.  Estimated deaths attributable to social factors in the United States. , 2011, American journal of public health.

[45]  A. Yashin,et al.  Cancer Risk and Behavioral Factors, Comorbidities, and Functional Status in the US Elderly Population , 2011, ISRN oncology.

[46]  Mary A. Gerend,et al.  Social Determinants of Black-White Disparities in Breast Cancer Mortality: A Review , 2008, Cancer Epidemiology Biomarkers & Prevention.

[47]  R. Hiatt,et al.  The social determinants of cancer: a challenge for transdisciplinary science. , 2008, American journal of preventive medicine.

[48]  S. Harper,et al.  Data set directory of social determinants of health at the local level , 2004 .