Mining 100 million notes to find homelessness and adverse childhood experiences: 2 case studies of rare and severe social determinants of health in electronic health records

Objective Understanding how to identify the social determinants of health from electronic health records (EHRs) could provide important insights to understand health or disease outcomes. We developed a methodology to capture 2 rare and severe social determinants of health, homelessness and adverse childhood experiences (ACEs), from a large EHR repository. Materials and Methods We first constructed lexicons to capture homelessness and ACE phenotypic profiles. We employed word2vec and lexical associations to mine homelessness-related words. Next, using relevance feedback, we refined the 2 profiles with iterative searches over 100 million notes from the Vanderbilt EHR. Seven assessors manually reviewed the top-ranked results of 2544 patient visits relevant for homelessness and 1000 patients relevant for ACE. Results word2vec yielded better performance (area under the precision-recall curve [AUPRC] of 0.94) than lexical associations (AUPRC = 0.83) for extracting homelessness-related words. A comparative study of searches for the 2 phenotypes revealed a higher performance achieved for homelessness (AUPRC = 0.95) than ACE (AUPRC = 0.79). A temporal analysis of the homeless population showed that the majority experienced chronic homelessness. Most ACE patients suffered sexual (70%) and/or physical (50.6%) abuse, with the top-ranked abuser keywords being "father" (21.8%) and "mother" (15.4%). Top prevalent associated conditions for homeless patients were lack of housing (62.8%) and tobacco use disorder (61.5%), while for ACE patients it was mental disorders (36.6%-47.6%). Conclusion We provide an efficient solution for mining homelessness and ACE information from EHRs, which can facilitate large clinical and genetic studies of these social determinants of health.

[1]  Cosmin Adrian Bejan,et al.  Assertion modeling and its role in clinical phenotype identification , 2013, J. Biomed. Informatics.

[2]  George Hripcsak,et al.  Next-generation phenotyping of electronic health records , 2012, J. Am. Medical Informatics Assoc..

[3]  George Hripcsak,et al.  Informatics to support the IOM social and behavioral domains and measures , 2015, J. Am. Medical Informatics Assoc..

[4]  Patrick E Shrout,et al.  Homelessness, health status, and health care use. , 2007, American journal of public health.

[5]  Cosmin Adrian Bejan,et al.  Pneumonia identification using statistical feature selection , 2012, J. Am. Medical Informatics Assoc..

[6]  W. Bruce Croft,et al.  Embedding-based Query Language Models , 2016, ICTIR.

[7]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[8]  Shuying Shen,et al.  Validating a strategy for psychosocial phenotyping using a large corpus of clinical text. , 2013, Journal of the American Medical Informatics Association : JAMIA.

[9]  Genevieve B. Melton,et al.  Examining the Use, Contents, and Quality of Free-Text Tobacco Use Documentation in the Electronic Health Record , 2014, AMIA.

[10]  Khaled Nagi,et al.  Open-Source Search Engines in the Cloud , 2015, IC3K.

[11]  Ricardo Baeza-Yates,et al.  A Comparison of Open Source Search Engines , 2007 .

[12]  Dale Nordenberg,et al.  Relationship of Childhood Abuse and Household Dysfunction to Many of the Leading Causes of Death in Adults: The Adverse Childhood Experiences (ACE) Study. , 2019, American journal of preventive medicine.

[13]  David Page,et al.  Area under the Precision-Recall Curve: Point Estimates and Confidence Intervals , 2013, ECML/PKDD.

[14]  Paul A. Harris,et al.  Desiderata for computable representations of electronic health records-driven phenotype algorithms , 2015, J. Am. Medical Informatics Assoc..

[15]  Yuan Luo,et al.  Identifying patient smoking status from medical discharge records. , 2008, Journal of the American Medical Informatics Association : JAMIA.

[16]  Gilad J. Kuperman,et al.  Identifying homelessness using health information exchange data , 2015, J. Am. Medical Informatics Assoc..

[17]  Rui Lin,et al.  Identity-by-Descent Mapping to Detect Rare Variants Conferring Susceptibility to Multiple Sclerosis , 2013, PloS one.

[18]  Yoshua Bengio,et al.  Learning Concept Embeddings for Query Expansion by Quantum Entropy Minimization , 2014, AAAI.

[19]  Serguei V. S. Pakhomov,et al.  Automated Extraction of Substance Use Information from Clinical Texts , 2015, AMIA.

[20]  Scott Proescholdbell,et al.  Adverse Childhood Experiences Related to Poor Adult Health Among Lesbian, Gay, and Bisexual Individuals. , 2016, American journal of public health.

[21]  Zongshan Lai,et al.  Datapoints: trends in mortality among homeless VA patients with severe mental illness. , 2013, Psychiatric services.

[22]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[23]  Joshua C. Denny,et al.  Mining phenotypic keywords from a large collection of clinical narratives , 2014 .

[24]  Nick Craswell,et al.  Query Expansion with Locally-Trained Word Embeddings , 2016, ACL.

[25]  Genevieve B. Melton,et al.  Social and Behavioral History Information in Public Health Datasets , 2012, AMIA.

[26]  Chen Lin,et al.  Automatic Prediction of Rheumatoid Arthritis Disease Activity from the Electronic Medical Records , 2013, AMIA.

[27]  Nigam H. Shah,et al.  Learning statistical models of phenotypes using noisy labeled training data , 2016, J. Am. Medical Informatics Assoc..

[28]  G. O. Lignac [Actual causes of death]. , 1951, Nederlands tijdschrift voor geneeskunde.

[29]  C. Nemeroff,et al.  The role of childhood trauma in the neurobiology of mood and anxiety disorders: preclinical and clinical studies , 2001, Biological Psychiatry.

[30]  Shuying Shen,et al.  Using Natural Language Processing on the Free Text of Clinical Documents to Screen for Evidence of Homelessness Among US Veterans , 2013, AMIA.

[31]  J S Haas,et al.  Factors associated with the health care utilization of homeless persons. , 2001, JAMA.

[32]  ndrew,et al.  HOSPITALIZATION COSTS ASSOCIATED WITH HOMELESSNESS IN NEW YORK CITY , 2000 .

[33]  Philip Cole,et al.  Tobacco-related mortality , 1994, Nature.

[34]  Behavioral Domains,et al.  Capturing Social and Behavioral Domains in Electronic Health Records , 2014 .

[35]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[36]  Wendy W. Chapman,et al.  A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries , 2001, J. Biomed. Informatics.

[37]  Stephen B. Johnson,et al.  A review of approaches to identifying patient phenotype cohorts using electronic health records , 2013, J. Am. Medical Informatics Assoc..

[38]  D. Fife,et al.  Mortality in a cohort of homeless adults in Philadelphia. , 1994, The New England journal of medicine.

[39]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[40]  D. Roden,et al.  Development of a Large‐Scale De‐Identified DNA Biobank to Enable Personalized Medicine , 2008, Clinical pharmacology and therapeutics.

[41]  Behavioral Domains,et al.  Capturing Social and Behavioral Domains and Measures in Electronic Health Records: Phase 2 , 2015 .

[42]  J. Gerberding,et al.  Actual causes of death in the United States, 2000. , 2004, JAMA.

[43]  B. Levy,et al.  Health care for homeless persons. , 2004, The New England journal of medicine.

[44]  R. Moos,et al.  The influence of co-occurring axis I disorders on treatment utilization and outcome in homeless patients with substance use disorders. , 2011, Addictive behaviors.

[45]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[46]  Matthew H Samore,et al.  The Feasibility of Using Large-Scale Text Mining to Detect Adverse Childhood Experiences in a VA-Treated Population. , 2015, Journal of traumatic stress.

[47]  Peter Szolovits,et al.  Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources , 2015, J. Am. Medical Informatics Assoc..

[48]  Timothy B. Smith,et al.  Social Relationships and Mortality Risk: A Meta-analytic Review , 2010, PLoS medicine.

[49]  Philippe Brouqui,et al.  a Homeless People , 1996 .

[50]  J. O’Connell,et al.  Premature Mortality in Homeless Populations : A Review of the Literature , 2005 .

[51]  Paul A. Harris,et al.  PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability , 2016, J. Am. Medical Informatics Assoc..

[52]  Hua Xu,et al.  Portability of an algorithm to identify rheumatoid arthritis in electronic health records , 2012, J. Am. Medical Informatics Assoc..