Novel text analytics approach to identify relevant literature for human health risk assessments: A pilot study with health effects of in utero exposures.

BACKGROUND Systematic reviews involve mining literature databases to identify relevant studies. Identifying potentially relevant studies can be informed by computational tools comparing text similarity between candidate studies and selected key (i.e., seed) references. Challenge Using computational approaches to identify relevant studies for risk assessments is challenging, as these assessments examine multiple chemical effects across lifestages (e.g., human health risk assessments) or specific effects of multiple chemicals (e.g., cumulative risk). The broad scope of potentially relevant literature can make selection of seed references difficult. Approach We developed a generalized computational scoping strategy to identify human health relevant studies for multiple chemicals and multiple effects. We used semi-supervised machine learning to prioritize studies to review manually with training data derived from references cited in the hazard identification sections of several US EPA Integrated Risk Information System (IRIS) assessments. These generic training data or seed studies were clustered with the unclassified corpus to group studies based on text similarity. Clusters containing a high proportion of seed studies were prioritized for manual review. Chemical names were removed from seed studies prior to clustering resulting in a generic, chemical-independent method for identifying potentially human health relevant studies. We developed a case study that focused on identifying the array of chemicals that have been studied with respect to in utero exposure to test the recall of this novel literature searching strategy. We then evaluated the general strategy of using generic, chemical-independent training data with two previous IRIS assessments by comparing studies predicted relevant to those used in the assessments (i.e., total relevant). Outcome A keyword search designed to retrieve studies that examined the in utero effects of environmental chemicals identified over 54,000 candidate references. Clustering algorithms were applied using 1456 studies from multiple IRIS assessments with chemical names removed as training data or seeds (i.e., semi-supervised learning). Using a six-algorithm ensemble approach 2602 articles, or approximately 5% of candidate references, were "voted" relevant by four or more clustering algorithms and manual review confirmed nearly 50% of these studies were relevant. Further evaluations on two IRIS assessments, using a nine-algorithm ensemble approach and a set of generic, chemical-independent, externally-derived seed studies correctly identified 77-83% of hazard identification studies published in the assessments and eliminated the need to manually screen more than 75% of search results on average. Limitations The chemical-independent approach used to build the training literature set provides a broad and unbiased picture across a variety of endpoints and environmental exposures but does not systematically identify all available data. Variance between actual and predicted relevant studies will be greater because of the external and non-random origin of seed study selection. This approach depends on access to readily available generic training data that can be used to locate relevant references in an unclassified corpus. Impact A generic approach to identifying human health relevant studies could be an important first step in literature evaluation for risk assessments. This initial scoping approach could facilitate faster literature evaluation by focusing reviewer efforts, as well as potentially minimize reviewer bias in selection of key studies. Using externally-derived training data has applicability particularly for databases with very low search precision where identifying training data may be cost-prohibitive.

[1]  D. Levy,et al.  A cross-sectional study of well water arsenic and child IQ in Maine schoolchildren , 2014, Environmental Health.

[2]  F. Perera,et al.  Prenatal Airborne Polycyclic Aromatic Hydrocarbon Exposure and Child IQ at Age 5 Years , 2009, Pediatrics.

[3]  Stan Matwin,et al.  Exploiting the systematic review protocol for classification of medical abstracts , 2011, Artif. Intell. Medicine.

[4]  Tao Hong,et al.  Supervised clustering for automated document classification and prioritization: a case study using toxicological abstracts , 2018, Environment Systems and Decisions.

[5]  J. Bonde,et al.  In utero exposure to persistent organochlorine pollutants and reproductive health in the human male , 2013, Reproduction.

[6]  F. Perera,et al.  Prenatal Polycyclic Aromatic Hydrocarbon (PAH) Exposure and Child Behavior at Age 6–7 Years , 2012, Environmental health perspectives.

[7]  Siddhartha Jonnalagadda,et al.  A new iterative method to reduce workload in systematic review process , 2013, Int. J. Comput. Biol. Drug Des..

[8]  Yindalon Aphinyanagphongs,et al.  Research Paper: Text Categorization Models for High-Quality Article Retrieval in Internal Medicine , 2004, J. Am. Medical Informatics Assoc..

[9]  David Ogilvie,et al.  Pinpointing needles in giant haystacks: use of text mining to reduce impractical screening workload in extremely large scoping reviews , 2014, Research synthesis methods.

[10]  William R. Hersh,et al.  Reducing workload in systematic review preparation using automated citation classification. , 2006, Journal of the American Medical Informatics Association : JAMIA.

[11]  T. Tsuda,et al.  Intrauterine Exposure to Methylmercury and Neurocognitive Functions: Minamata Disease , 2015, Archives of environmental & occupational health.

[12]  M. Kennedy,et al.  Fish consumption during child bearing age: a quantitative risk-benefit analysis on neurodevelopment. , 2013, Food and chemical toxicology : an international journal published for the British Industrial Biological Research Association.

[13]  June-Soo Park,et al.  Hydroxylated polybrominated diphenyl ethers in paired maternal and cord sera. , 2013, Environmental science & technology.

[14]  N. Holland,et al.  Pesticide toxicity and the developing brain. , 2008, Basic & clinical pharmacology & toxicology.

[15]  Carla E. Brodley,et al.  Semi-automated screening of biomedical citations for systematic reviews , 2010, BMC Bioinformatics.

[16]  Patrice Sutton,et al.  An evidence-based medicine methodology to bridge the gap between clinical and environmental health sciences. , 2011, Health affairs.

[17]  S. Jeng,et al.  Perfluorinated Compound Levels in Cord Blood and Neurodevelopment at 2 Years of Age , 2013, Epidemiology.

[18]  Gábor L. Lövei,et al.  Application of Systematic Review Methodology to Food and Feed Safety Assessments to Support Decision Making , 2010 .

[19]  Christopher W. Belter,et al.  Citation analysis as a literature search method for systematic reviews , 2016, J. Assoc. Inf. Sci. Technol..

[20]  S. Ananiadou,et al.  Using text mining for study identification in systematic reviews: a systematic review of current approaches , 2015, Systematic Reviews.

[21]  M. Karagas,et al.  In utero arsenic exposure and fetal immune repertoire in a US pregnancy cohort. , 2014, Clinical immunology.

[22]  Division on Earth Progress Toward Transforming the Integrated Risk Information System (IRIS) Program: A 2018 Evaluation , 2018 .

[23]  Dina Demner-Fushman,et al.  Screening nonrandomized studies for medical systematic reviews: A comparative study of classifiers , 2012, Artif. Intell. Medicine.

[24]  S. Golder,et al.  Developing efficient search strategies to identify reports of adverse effects in MEDLINE and EMBASE. , 2006, Health information and libraries journal.

[25]  Linda S. Birnbaum,et al.  Implementing Systematic Review at the National Toxicology Program: Status and Next Steps , 2013, Environmental health perspectives.

[26]  John R. Bucher,et al.  Systematic Review and Evidence Integration for Literature-Based Environmental Health Science Assessments , 2014, Environmental health perspectives.

[27]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[28]  A. J. Gandolfi,et al.  In utero and early childhood exposure to arsenic decreases lung function in children , 2015, Journal of applied toxicology : JAT.

[29]  Louise Ryan,et al.  Combining data from multiple sources, with applications to environmental risk assessment , 2008, Statistics in medicine.

[30]  Division on Earth Review of Epa's Integrated Risk Information System (Iris) Process , 2014 .

[31]  Asa Bradman,et al.  In Utero and Childhood Polybrominated Diphenyl Ether (PBDE) Exposures and Neurodevelopment in the CHAMACOS Study , 2012, Environmental health perspectives.

[32]  Paul G Shekelle,et al.  Machine Learning Versus Standard Techniques for Updating Searches for Systematic Reviews: A Diagnostic Accuracy Study , 2017, Annals of Internal Medicine.

[33]  S. Murphy,et al.  Investigating Epigenetic Effects of Prenatal Exposure to Toxic Metals in Newborns: Challenges and Benefits , 2014, Medical Epigenetics.

[34]  Laura A. Levit,et al.  Finding what works in health care : standards for systematic reviews , 2011 .

[35]  L. Moore,et al.  Increased Lung and Bladder Cancer Incidence in Adults after In Utero and Early-Life Arsenic Exposure , 2014, Cancer Epidemiology, Biomarkers & Prevention.