Textpresso Central: a customizable platform for searching, text mining, viewing, and curating biomedical literature

BackgroundThe biomedical literature continues to grow at a rapid pace, making the challenge of knowledge retrieval and extraction ever greater. Tools that provide a means to search and mine the full text of literature thus represent an important way by which the efficiency of these processes can be improved.ResultsWe describe the next generation of the Textpresso information retrieval system, Textpresso Central (TPC). TPC builds on the strengths of the original system by expanding the full text corpus to include the PubMed Central Open Access Subset (PMC OA), as well as the WormBase C. elegans bibliography. In addition, TPC allows users to create a customized corpus by uploading and processing documents of their choosing. TPC is UIMA compliant, to facilitate compatibility with external processing modules, and takes advantage of Lucene indexing and search technology for efficient handling of millions of full text documents.Like Textpresso, TPC searches can be performed using keywords and/or categories (semantically related groups of terms), but to provide better context for interpreting and validating queries, search results may now be viewed as highlighted passages in the context of full text. To facilitate biocuration efforts, TPC also allows users to select text spans from the full text and annotate them, create customized curation forms for any data type, and send resulting annotations to external curation databases. As an example of such a curation form, we describe integration of TPC with the Noctua curation tool developed by the Gene Ontology (GO) Consortium.ConclusionTextpresso Central is an online literature search and curation platform that enables biocurators and biomedical researchers to search and mine the full text of literature by integrating keyword and category searches with viewing search results in the context of the full text. It also allows users to create customized curation interfaces, use those interfaces to make annotations linked to supporting evidence statements, and then send those annotations to any database in the world.Textpresso Central URL: http://www.textpresso.org/tpc

[1]  Pierre Gönczy,et al.  Phosphorylation of SAS-6 by ZYG-1 is critical for centriole formation in C. elegans embryos. , 2009, Developmental cell.

[2]  Martín Pérez-Pérez,et al.  Construction of antimicrobial peptide-drug combination networks from scientific literature based on a semi-automated curation workflow , 2016, Database J. Biol. Databases Curation.

[3]  The Gene Ontology Consortium Expansion of the Gene Ontology knowledgebase and resources , 2016, Nucleic Acids Res..

[4]  Jeyakumar Natarajan,et al.  An overview of the BioCreative 2012 Workshop Track III: interactive text mining task , 2013, Database J. Biol. Databases Curation.

[5]  Zhiyong Lu,et al.  BioCreative III interactive task: an overview , 2011, BMC Bioinformatics.

[6]  John M. Hancock,et al.  Using ontologies to describe mouse phenotypes , 2004, Genome Biology.

[7]  K. Bretonnel Cohen,et al.  Text mining for the biocuration workflow , 2012, Database J. Biol. Databases Curation.

[8]  André L. M. Santos,et al.  BioCreative V BioC track overview: collaborative biocurator assistant task for BioGRID , 2016, Database J. Biol. Databases Curation.

[9]  Kimberly Van Auken,et al.  Text mining in the biocuration workflow: applications for literature curation at WormBase, dictyBase and TAIR , 2012, Database J. Biol. Databases Curation.

[10]  Kimberly Van Auken,et al.  Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation , 2009, BMC Bioinformatics.

[11]  Jian Zhang,et al.  Protein Ontology (PRO): enhancing and scaling up the representation of protein entities , 2016, Nucleic Acids Res..

[12]  Claire O'Donovan,et al.  Biocurators and Biocuration: surveying the 21st century challenges , 2012, Database J. Biol. Databases Curation.

[13]  Jon R Lorsch,et al.  Perspective: Sustaining the big-data ecosystem , 2015, Nature.

[14]  The Gene Ontology Consortium,et al.  Expansion of the Gene Ontology knowledgebase and resources , 2016, Nucleic Acids Res..

[15]  K. Bretonnel Cohen,et al.  The structural and content aspects of abstracts versus bodies of full text journal articles are different , 2010, BMC Bioinformatics.

[16]  R. Durbin,et al.  The Sequence Ontology: a tool for the unification of genome annotations , 2005, Genome Biology.

[17]  Kara Dolinski,et al.  The BioC-BioGRID corpus: full text articles annotated for curation of protein–protein and genetic interactions , 2017, Database J. Biol. Databases Curation.

[18]  Bohdan Schneider,et al.  A Biocurator Perspective: Annotation at the Research Collaboratory for Structural Bioinformatics Protein Data Bank , 2006, PLoS Comput. Biol..

[19]  S. Lewis,et al.  Uberon, an integrative multi-species anatomy ontology , 2012, Genome Biology.

[20]  K. Bretonnel Cohen,et al.  A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools , 2012, BMC Bioinformatics.

[21]  Juliane Fluck,et al.  Construction of biological networks from unstructured information based on a semi-automated curation workflow , 2015, Database J. Biol. Databases Curation.

[22]  K. Bretonnel Cohen,et al.  U-Compare: A modular NLP workflow construction and evaluation system , 2011, IBM J. Res. Dev..

[23]  Christoph Steinbeck,et al.  The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013 , 2012, Nucleic Acids Res..

[24]  Hans-Michael Müller,et al.  Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature , 2004, PLoS biology.

[25]  Jimmy J. Lin Is searching full text more effective than searching abstracts? , 2009, BMC Bioinformatics.

[26]  K. Bretonnel Cohen,et al.  Manual curation is not sufficient for annotation of genomic databases , 2007, ISMB/ECCB.

[27]  A. Valencia,et al.  Linking genes to literature: text mining, information extraction, and retrieval applications for biology , 2008, Genome Biology.

[28]  Hilmar Lapp,et al.  Muscle Logic: New Knowledge Resource for Anatomy Enables Comprehensive Searches of the Literature on the Feeding Muscles of Mammals , 2016, PloS one.

[29]  Kimberly Van Auken,et al.  Automatic categorization of diverse experimental information in the bioscience literature , 2012, BMC Bioinformatics.

[30]  Peter McQuilton Opportunities for text mining in the FlyBase genetic literature curation workflow , 2012, Database J. Biol. Databases Curation.

[31]  Manuel C. Peitsch,et al.  Semi-Automated Curation Allows Causal Network Model Building for the Quantification of Age-Dependent Plaque Progression in ApoE−/− Mouse , 2016, Gene regulation and systems biology.

[32]  Fabio Rinaldi,et al.  Strategies towards digital and semi-automated curation in RegulonDB , 2017, Database J. Biol. Databases Curation.

[33]  Zhiyong Lu,et al.  Biocuration workflows and text mining: overview of the BioCreative 2012 Workshop Track II , 2012, Database J. Biol. Databases Curation.

[34]  Jeyakumar Natarajan,et al.  Overview of the interactive task in BioCreative V , 2015, Database J. Biol. Databases Curation.

[35]  K. Kemphues,et al.  The C. elegans zyg-1 Gene Encodes a Regulator of Centrosome Duplication with Distinct Maternal and Paternal Roles in the Embryo , 2001, Cell.

[36]  Kara Dolinski,et al.  The BioGRID interaction database: 2017 update , 2016, Nucleic Acids Res..

[37]  Zhiyong Lu,et al.  BC4GO: a full-text corpus for the BioCreative IV GO task , 2014, Database J. Biol. Databases Curation.

[38]  Raymond Y. N. Lee,et al.  Building a Cell and Anatomy Ontology of Caenorhabditis Elegans , 2003, Comparative and functional genomics.

[39]  Tanya Z. Berardini,et al.  Building an efficient curation workflow for the Arabidopsis literature corpus , 2012, Database J. Biol. Databases Curation.

[40]  Zhiyong Lu,et al.  Pressing needs of biomedical text mining in biocuration and beyond: opportunities and challenges , 2016, Database J. Biol. Databases Curation.

[41]  Karin M. Verspoor,et al.  BioC: a minimalist approach to interoperability for biomedical text processing , 2013, AMIA.