Open Agile text mining for bioinformatics: the PubAnnotation ecosystem

Abstract Motivation Most currently available text mining tools share two characteristics that make them less than optimal for use by biomedical researchers: they require extensive specialist skills in natural language processing and they were built on the assumption that they should optimize global performance metrics on representative datasets. This is a problem because most end-users are not natural language processing specialists and because biomedical researchers often care less about global metrics like F-measure or representative datasets than they do about more granular metrics such as precision and recall on their own specialized datasets. Thus, there are fundamental mismatches between the assumptions of much text mining work and the preferences of potential end-users. Results This article introduces the concept of Agile text mining, and presents the PubAnnotation ecosystem as an example implementation. The system approaches the problems from two perspectives: it allows the reformulation of text mining by biomedical researchers from the task of assembling a complete system to the task of retrieving warehoused annotations, and it makes it possible to do very targeted customization of the pre-existing system to address specific end-user requirements. Two use cases are presented: assisted curation of the GlycoEpitope database, and assessing coverage in the literature of pre-eclampsia-associated genes. Availability and implementation The three tools that make up the ecosystem, PubAnnotation, PubDictionaries and TextAE are publicly available as web services, and also as open source projects. The dictionaries and the annotation datasets associated with the use cases are all publicly available through PubDictionaries and PubAnnotation, respectively.

[1]  Aron Henriksson,et al.  Improving Terminology Mapping in Clinical Text with Context-Sensitive Spelling Correction. , 2017, Studies in health technology and informatics.

[2]  Sophia Ananiadou,et al.  Text-mining-assisted biocuration workflows in Argo , 2014, Database J. Biol. Databases Curation.

[3]  Morteza Dehghani,et al.  Dictionaries and distributions: Combining expert knowledge and large scale textual data content analysis , 2018, Behavior research methods.

[4]  Kevin Bretonnel Cohen,et al.  Biomedical Natural Language Processing , 2014 .

[5]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[6]  Selected abstracts from the 1st Biomedical Linked Annotation Hackathon (BLAH1). , 2015, BMC proceedings.

[7]  K. Bretonnel Cohen,et al.  Text mining for the biocuration workflow , 2012, Database J. Biol. Databases Curation.

[8]  Lorraine K. Tanabe,et al.  Tagging gene and protein names in biomedical text , 2002, Bioinform..

[9]  Masahiro Tanaka,et al.  Agile parallel bioinformatics workflow management using Pwrake , 2011, BMC Research Notes.

[10]  B. Condie,et al.  Using the Textpresso Site-Specific Recombinases Web server to identify Cre expressing mouse strains and floxed alleles. , 2014, Methods in molecular biology.

[11]  Kimberly Van Auken,et al.  Text mining in the biocuration workflow: applications for literature curation at WormBase, dictyBase and TAIR , 2012, Database J. Biol. Databases Curation.

[12]  Sophia Ananiadou,et al.  Argo: enabling the development of bespoke workflows and services for disease annotation , 2016, Database J. Biol. Databases Curation.

[13]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[14]  Russ B. Altman,et al.  Pharmspresso: a text mining tool for extraction of pharmacogenomic concepts and relationships from full text , 2009, BMC Bioinformatics.

[15]  Beatrice Alex,et al.  Assisted Curation: Does Text Mining Really Help? , 2007, Pacific Symposium on Biocomputing.

[16]  K. Bretonnel Cohen,et al.  Guideline Design of an Active Gene Annotation Corpus for the Purpose of Drug Repurposing , 2018, 2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI).

[17]  Yue Wang,et al.  PubAnnotation - a persistent and sharable corpus and annotation repository , 2012, BioNLP@HLT-NAACL.

[18]  Karin M. Verspoor,et al.  A UIMA wrapper for the NCBO annotator , 2010, Bioinform..

[19]  Davide Cittaro,et al.  Revealing the acute asthma ignorome: characterization and validation of uninvestigated gene networks , 2016, Scientific Reports.

[20]  Zhiyong Lu,et al.  PubTator: a web-based text mining tool for assisting biocuration , 2013, Nucleic Acids Res..

[21]  H. Clausen,et al.  Tumor-associated carbohydrate antigens. , 2007, Journal of oral pathology.

[22]  Agile Manifesto,et al.  Manifesto for Agile Software Development , 2001 .

[23]  James J. Walker,et al.  Pre-eclampsia , 2000, The Lancet.

[24]  David Milward,et al.  Precise Medication Extraction using Agile Text Mining , 2014, Louhi@EACL.

[25]  K. Bretonnel Cohen,et al.  Large-scale biomedical concept recognition: an evaluation of current automatic annotators and their parameters , 2014, BMC Bioinformatics.

[26]  Sophia Ananiadou,et al.  BioNLP: Proceedings of the 2012 Workshop on Biomedical Natural Language Processing , 2012 .

[27]  J. Brownstein,et al.  Evaluation of Facebook and Twitter Monitoring to Detect Safety Signals for Medical Products: An Analysis of Recent FDA Safety Alerts , 2017, Drug Safety.

[28]  Alexander A. Morgan,et al.  BioCreAtIvE Task 1A: gene mention finding evaluation , 2005, BMC Bioinformatics.

[29]  Hans-Michael Müller,et al.  Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature , 2004, PLoS biology.

[30]  Shaun J. Grannis,et al.  Toward better public health reporting using existing off the shelf approaches: A comparison of alternative cancer detection approaches using plaintext medical data and non-dictionary based feature selection , 2016, J. Biomed. Informatics.

[31]  James Pustejovsky,et al.  A Methodology for Using Professional Knowledge in Corpus , 2013 .

[32]  Max Silberztein,et al.  NooJ: a Linguistic Annotation System for Corpus Processing , 2005, HLT.

[33]  Marco Duz,et al.  Validation of an Improved Computer-Assisted Technique for Mining Free-Text Electronic Medical Records , 2017, JMIR medical informatics.

[34]  Kimberly Van Auken,et al.  Textpresso Central: a customizable platform for searching, text mining, viewing, and curating biomedical literature , 2018, BMC Bioinformatics.

[35]  K. Bretonnel Cohen,et al.  U-Compare: share and compare text mining tools with UIMA , 2009, Bioinform..

[36]  S. Hakomori,et al.  Tumor-associated carbohydrate antigens. , 1984, Annual review of immunology.

[37]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[38]  Scott L. DuVall,et al.  Unlocking echocardiogram measurements for heart disease research through natural language processing , 2017, BMC Cardiovascular Disorders.

[39]  Kalpana Raja,et al.  Agile text mining for the 2014 i2b2/UTHealth Cardiac risk factors challenge , 2015, J. Biomed. Informatics.

[40]  Robert Bossy,et al.  AlvisAE: a collaborative Web text annotation editor for knowledge acquisition , 2012, LAW@ACL.

[41]  Hiroshi Tanaka,et al.  The Pre-Eclampsia Ontology: A Disease Ontology Representing the Domain Knowledge Specific to Pre-Eclampsia , 2016, PloS one.

[42]  Michael J. Hauan,et al.  How much effort is needed to keep up with the literature relevant for primary care? , 2004, Journal of the Medical Library Association : JMLA.

[43]  Ramin Homayouni,et al.  Functionally Enigmatic Genes: A Case Study of the Brain Ignorome , 2014, PloS one.

[44]  Pierre Zweigenbaum,et al.  The Quaero French Medical Corpus : A Ressource for Medical Entity Recognition and Normalization , 2014 .

[45]  Bas E. Dutilh,et al.  SUPER-FOCUS: a tool for agile functional analysis of shotgun metagenomic data , 2015, Bioinform..

[46]  Rong Chen,et al.  Finding Disease-Related Genomic Experiments Within an International Repository: First Steps in Translational Bioinformatics , 2006, AMIA.

[47]  Gang Su,et al.  GSearcher: Agile Attribute Querying for Biological Networks , 2010, Bioinform..

[48]  Toshisuke Kawasaki,et al.  GlycoEpitope: the Integrated Database of Carbohydrate Antigens and Antibodies , 2006 .

[49]  Sophia Ananiadou,et al.  Argo: an integrative, interactive, text mining-based workbench supporting curation , 2012, Database J. Biol. Databases Curation.

[50]  Ni Ai,et al.  Revealing topics and their evolution in biomedical literature using Bio-DTM: a case study of ginseng , 2017, Chinese Medicine.

[51]  Jyoti Kamal,et al.  Towards Agile and Test-Driven Development in NLP Applications , 2009 .

[52]  Lars Juhl Jensen Tagger: BeCalm API for rapid named entity recognition , 2017 .

[53]  L. Cordero,et al.  Maternal Preeclampsia and Neonatal Outcomes , 2011, Journal of pregnancy.

[54]  Philip Scott,et al.  Informatics for Health: Connected Citizen-Led Wellness and Population Health , 2017 .

[55]  M. Lewandoski,et al.  Mouse Molecular Embryology , 2014, Methods in Molecular Biology.

[56]  L. Grivell,et al.  Text mining for biology - the way forward: opinions from leading scientists , 2008, Genome Biology.

[57]  Sophia Ananiadou,et al.  COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature , 2019, Biodiversity data journal.

[58]  Margaret Haber,et al.  Development of a Pediatric Adverse Events Terminology , 2017, Pediatrics.

[59]  Jeyakumar Natarajan,et al.  Overview of the interactive task in BioCreative V , 2015, Database J. Biol. Databases Curation.

[60]  Jelena Jovanovic,et al.  Semantic annotation in biomedicine: the current landscape , 2017, Journal of Biomedical Semantics.

[61]  Jari Björne,et al.  U-Compare bio-event meta-service: compatible BioNLP event extraction services , 2011, BMC Bioinformatics.

[62]  Martin Thomas,et al.  Querying Multimodal Annotation: A Concordancer for GeM , 2007, LAW@ACL.