Classification and analysis of a large collection of in vivo bioassay descriptions

Testing potential drug treatments in animal disease models is a decisive step of all preclinical drug discovery programs. Yet, despite the importance of such experiments for translational medicine, there have been relatively few efforts to comprehensively and consistently analyze the data produced by in vivo bioassays. This is partly due to their complexity and lack of accepted reporting standards—publicly available animal screening data are only accessible in unstructured free-text format, which hinders computational analysis. In this study, we use text mining to extract information from the descriptions of over 100,000 drug screening-related assays in rats and mice. We retrieve our dataset from ChEMBL—an open-source literature-based database focused on preclinical drug discovery. We show that in vivo assay descriptions can be effectively mined for relevant information, including experimental factors that might influence the outcome and reproducibility of animal research: genetic strains, experimental treatments, and phenotypic readouts used in the experiments. We further systematize extracted information using unsupervised language model (Word2Vec), which learns semantic similarities between terms and phrases, allowing identification of related animal models and classification of entire assay descriptions. In addition, we show that random forest models trained on features generated by Word2Vec can predict the class of drugs tested in different in vivo assays with high accuracy. Finally, we combine information mined from text with curated annotations stored in ChEMBL to investigate the patterns of usage of different animal models across a range of experiments, drug classes, and disease areas.

[1]  Carlijn R Hooijmans,et al.  Enhancing search efficiency by means of a search filter for finding all studies on animal experimentation in PubMed , 2010, Laboratory animals.

[2]  R. Stevens,et al.  Bias in the reporting of sex and age in biomedical research on mouse models , 2016, eLife.

[3]  M. Davisson Rules and guidelines for genetic nomenclature in mice: excerpted version , 1997, Transgenic Research.

[4]  M. Limburg,et al.  Nimodipine in Animal Model Experiments of Focal Cerebral Ischemia: A Systematic Review , 2001, Stroke.

[5]  S. Chandrasekhar,et al.  Basic Principles of Drug Discovery and Development , 2016 .

[6]  Madian Khabsa,et al.  Digital commons , 2020, Internet Policy Rev..

[7]  Janan T. Eppig,et al.  The Vertebrate Trait Ontology: a controlled vocabulary for the annotation of trait data across species , 2013, Journal of Biomedical Semantics.

[8]  M. de Rijke,et al.  Short Text Similarity with Word Embeddings , 2015, CIKM.

[9]  Walter Daelemans,et al.  Applying System Combination to Base Noun Phrase Identification , 2000, COLING.

[10]  Arzucan Özgür,et al.  Detection and categorization of bacteria habitats using shallow linguistic analysis , 2015, BMC Bioinformatics.

[11]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[12]  Michael S. Rand,et al.  Selection of Biomedical Animal Models , 2008 .

[13]  C. Begley,et al.  Drug development: Raise standards for preclinical cancer research , 2012, Nature.

[14]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[15]  A. Butte,et al.  Translational Bioinformatics: Data‐driven Drug Discovery and Development , 2012, Clinical pharmacology and therapeutics.

[16]  F. Hefti Requirements for a lead compound to become a clinical candidate , 2008, BMC Neuroscience.

[17]  Andrew L. Maas,et al.  A Probabilistic Model for Semantic Word Vectors , 2010 .

[18]  Melinda R. Dwinell,et al.  Rat Strain Ontology: structured controlled vocabulary designed to facilitate access to strain data at RGD , 2013, J. Biomed. Semant..

[19]  J. Bonnet,et al.  Site-Specific Reduction of Oxidative and Lipid Metabolism in Adipose Tissue of 3′-Azido-3′-Deoxythymidine-Treated Rats , 2006, Antimicrobial Agents and Chemotherapy.

[20]  M. Ritskes-Hoitinga,et al.  Progress in Using Systematic Reviews of Animal Studies to Improve Translational Research , 2013, PLoS medicine.

[21]  P. Bork,et al.  Literature mining for the biologist: from information retrieval to biological discovery , 2006, Nature Reviews Genetics.

[22]  Peter Sandercock,et al.  Systematic reviews of animal experiments , 2002, The Lancet.

[23]  J. Ioannidis Extrapolating from Animals to Humans , 2012, Science Translational Medicine.

[24]  Paul N. Schofield,et al.  The mouse pathology ontology, MPATH; structure and applications , 2013, Journal of Biomedical Semantics.

[25]  Mathieu Bastian,et al.  Gephi: An Open Source Software for Exploring and Manipulating Networks , 2009, ICWSM.

[26]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[27]  Yael Garten,et al.  Recent progress in automatically extracting information from the pharmacogenomic literature. , 2010, Pharmacogenomics.

[28]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[29]  Weisong Liu,et al.  The Rat Genome Database 2015: genomic, phenotypic and environmental variations and disease , 2014, Nucleic Acids Res..

[30]  Andy Gray,et al.  The selection and use of essential medicines. , 2008, World Health Organization technical report series.

[31]  Udo Hahn,et al.  Towards Text Knowledge Engineering , 1998, AAAI/IAAI.

[32]  Olivier Dameron,et al.  ATOL: The Multi-species Livestock Trait Ontology , 2012, MTSR.

[33]  Le Cong,et al.  Multiplex Genome Engineering Using CRISPR/Cas Systems , 2013, Science.

[34]  P. Sandercock,et al.  Comparison of treatment effects between animal experiments and clinical trials: systematic review , 2006, BMJ : British Medical Journal.

[35]  P. Sandercock,et al.  Where is the evidence that animal research benefits humans? , 2004, BMJ : British Medical Journal.

[36]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[37]  Adem Can,et al.  The tail suspension test. , 2011, Journal of visualized experiments : JoVE.

[38]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[39]  A. Valencia,et al.  Linking genes to literature: text mining, information extraction, and retrieval applications for biology , 2008, Genome Biology.

[40]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[41]  M. Hepple,et al.  Semantic Annotation of Clinical Text : The CLEF Corpus , 2008 .

[42]  Richard Van Noorden Scientists still fail to record age and sex of lab mice , 2016 .

[43]  I. Clark,et al.  Increased Survival after Gemfibrozil Treatment of Severe Mouse Influenza , 2007, Antimicrobial Agents and Chemotherapy.

[44]  Melinda R. Dwinell,et al.  Three Ontologies to Define Phenotype Measurement Data , 2012, Front. Gene..

[45]  Sabine Buchholz,et al.  Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[46]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[47]  Joseph P Huston,et al.  Behavioral phenotyping of the MPTP mouse model of Parkinson's disease , 2001, Behavioural Brain Research.

[48]  D Le Bars,et al.  Animal models of nociception. , 2001, Pharmacological reviews.

[49]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[50]  Cynthia L. Smith,et al.  The Mammalian Phenotype Ontology as a tool for annotating, analyzing and comparing phenotypic information , 2004, Genome Biology.

[51]  M. V. Roy,et al.  Animal models in translational medicine: Validation and prediction , 2014 .

[52]  Robert P. Sheridan,et al.  Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction , 2013, J. Chem. Inf. Model..

[53]  George Papadatos,et al.  The ChEMBL bioactivity database: an update , 2013, Nucleic Acids Res..

[54]  Gang Feng,et al.  Disease Ontology: a backbone for disease semantic integration , 2011, Nucleic Acids Res..

[55]  Aileen J F King,et al.  The use of animal models in diabetes research , 2012, British journal of pharmacology.

[56]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[57]  D. Rebholz-Schuhmann,et al.  Text-mining solutions for biomedical research: enabling integrative biology , 2012, Nature Reviews Genetics.

[58]  P. Robinson,et al.  The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. , 2008, American journal of human genetics.

[59]  D. Howells,et al.  Can Animal Models of Disease Reliably Inform Human Studies? , 2010, PLoS medicine.

[60]  E. Fisher,et al.  The origins and uses of mouse outbred stocks , 2005, Nature Genetics.

[61]  Brian E. Howard,et al.  SWIFT-Review: a text-mining workbench for systematic review , 2016, Systematic Reviews.

[62]  Judith A. Blake,et al.  The Mouse Genome Database (MGD): facilitating mouse as a model for human biology and disease , 2014, Nucleic Acids Res..

[63]  Judith A Blake,et al.  Mouse Genome Database , 2000, Mammalian Genome.

[64]  Carlijn R Hooijmans,et al.  A step-by-step guide to systematically identify all relevant animal studies , 2012, Laboratory animals.

[65]  L. Breiman OUT-OF-BAG ESTIMATION , 1996 .

[66]  M. Festing,et al.  Inbred Strains Should Replace Outbred Stocks in Toxicology, Safety Testing, and Drug Development , 2010, Toxicologic pathology.

[67]  B T Clause,et al.  The Wistar rat as a right choice: Establishing mammalian standards and the ideal of a standardized mammal , 1993, Journal of the history of biology.

[68]  Jun'ichi Tsujii,et al.  Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data , 2005, HLT.

[69]  Andy Gray,et al.  The Selection and Use of Essential Medicines. , 2015, World Health Organization technical report series.

[70]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[71]  John M. Hancock,et al.  Using ontologies to describe mouse phenotypes , 2004, Genome Biology.

[72]  M. Festing Evidence should trump intuition by preferring inbred strains to outbred stocks in preclinical research. , 2014, ILAR journal.

[73]  Yong Wang,et al.  Network predicting drug's anatomical therapeutic chemical code , 2013, Bioinform..