Combining pattern matching with word embeddings for the extraction of experimental variables from scientific literature

Scientists frequently use experiments published in other articles or reports by governing entities (e.g. NIH) as templates for reporting on their own experiments. Those templates occasionally change to reflect new discoveries. For creating retrospective studies and meta-analyses, finding the template parameters associated with scientific results can be critical. To aid in the extraction of experimental parameters (e.g. animal housing temperature) in a corpus of ∼8M scientific reports, we used a combination of pattern matching, part of speech tagging, units and measures extraction, and machine learning. We describe a use case where the housing temperature used for experiments involving mice was shown to impact their response to tumor reduction drugs. We show that 1) combining deep learning and pattern matching is a good model to address the problem described and 2) that researcher's behavior and experimental template usage takes a while to change after the publication of an important discovery.

[1]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[2]  Jessica A. Turner,et al.  The Ontology for Biomedical Investigations , 2016, PloS one.

[3]  Helena Deus A training set for sentences describing temperature housing conditions for mice , 2017 .

[4]  Gabriela Bindea,et al.  iLAP: a workflow-driven software for experimental protocol development, data acquisition and analysis , 2009, BMC Bioinformatics.

[5]  Eduard H. Hovy,et al.  Experiment Segmentation in Scientific Discourse as Clause-level Structured Prediction using Recurrent Neural Networks , 2017, ArXiv.

[6]  K. Bretonnel Cohen,et al.  Large-scale biomedical concept recognition: an evaluation of current automatic annotators and their parameters , 2014, BMC Bioinformatics.

[7]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[8]  B. Hylander,et al.  Housing temperature-induced stress drives therapeutic resistance in murine tumour models through β2-adrenergic receptor activation , 2015, Nature Communications.

[9]  Emilie Marcus A STAR Is Born , 2016, Cell.

[10]  Kalina Bontcheva,et al.  Getting More Out of Biomedical Documents with GATE's Full Lifecycle Open Source Text Analytics , 2013, PLoS Comput. Biol..

[11]  Sunil Kumar Sahu,et al.  Relation extraction from clinical texts using domain invariant convolutional neural network , 2016, BioNLP@ACL.

[12]  Tapio Salakoski,et al.  Distributional Semantics Resources for Biomedical Text Processing , 2013 .

[13]  C. J. Gordon,et al.  Baseline tumor growth and immune control in laboratory mice are significantly influenced by subthermoneutral housing temperature , 2013, Proceedings of the National Academy of Sciences.

[14]  Xiaolong Wang,et al.  Evaluating Word Representation Features in Biomedical Named Entity Recognition Tasks , 2014, BioMed research international.