Accelerating annotation of articles via automated approaches: evaluation of the neXtA5 curation-support tool by neXtProt

Abstract The development of efficient text-mining tools promises to boost the curation workflow by significantly reducing the time needed to process the literature into biological databases. We have developed a curation support tool, neXtA5, that provides a search engine coupled with an annotation system directly integrated into a biocuration workflow. neXtA5 assists curation with modules optimized for the thevarious curation tasks: document triage, entity recognition and information extraction. Here, we describe the evaluation of neXtA5 by expert curators. We first assessed the annotations of two independent curators to provide a baseline for comparison. To evaluate the performance of neXtA5, we submitted requests and compared the neXtA5 results with the manual curation. The analysis focuses on the usability of neXtA5 to support the curation of two types of data: biological processes (BPs) and diseases (Ds). We evaluated the relevance of the papers proposed as well as the recall and precision of the suggested annotations. The evaluation of document triage by neXtA5 precision showed that both curators agree with neXtA5 for 67 (BP) and 63% (D) of abstracts, while curators agree on accepting or rejecting an abstract ~80% of the time. Hence, the precision of the triage system is satisfactory. For concept extraction, curators approved 35 (BP) and 25% (D) of the neXtA5 annotations. Conversely, neXtA5 successfully annotated up to 36 (BP) and 68% (D) of the terms identified by curators. The user feedback obtained in these tests highlighted the need for improvement in the ranking function of neXtA5 annotations. Therefore, we transformed the information extraction component into an annotation ranking system. This improvement results in a top precision (precision at first rank) of 59 (D) and 63% (BP). These results suggest that when considering only the first extracted entity, the current system achieves a precision comparable with expert biocurators.

[1]  K. Bretonnel Cohen,et al.  Text mining for the biocuration workflow , 2012, Database J. Biol. Databases Curation.

[2]  J. Casanova,et al.  IRAK4 Kinase Activity Is Redundant for Interleukin-1 (IL-1) Receptor-associated Kinase Phosphorylation and IL-1 Responsiveness* , 2004, Journal of Biological Chemistry.

[3]  Sophia Ananiadou,et al.  Supporting the annotation of chronic obstructive pulmonary disease (COPD) phenotypes with text mining workflows , 2015, Journal of Biomedical Semantics.

[4]  Patrick Ruch,et al.  Triage by ranking to support the curation of protein interactions , 2017, Database J. Biol. Databases Curation.

[5]  Sophia Ananiadou,et al.  Europe PMC: a full-text literature database for the life sciences and platform for innovation , 2014, Nucleic Acids Res..

[6]  Sophia Ananiadou,et al.  SciLite: a platform for displaying text-mined annotations as a means to link research articles with biological data , 2017, Wellcome open research.

[7]  Sophia Ananiadou,et al.  Text-mining-assisted biocuration workflows in Argo , 2014, Database J. Biol. Databases Curation.

[8]  David S. Wishart,et al.  PolySearch2: a significantly improved text-mining system for discovering associations between human diseases, genes, drugs, metabolites, toxins and more , 2015, Nucleic Acids Res..

[9]  Shulin Chen,et al.  The aspartic acid of Fyn at 390 is critical for neuronal migration during corticogenesis. , 2014, Experimental cell research.

[10]  The Gene Ontology Consortium Expansion of the Gene Ontology knowledgebase and resources , 2016, Nucleic Acids Res..

[11]  Lars Juhl Jensen,et al.  EXTRACT: interactive extraction of environment metadata and term suggestion for metagenomic sample annotation , 2016, Database J. Biol. Databases Curation.

[12]  Fabio Rinaldi,et al.  Assisted curation of regulatory interactions and growth conditions of OxyR in E. coli K-12 , 2014, Database J. Biol. Databases Curation.

[13]  Kimberly Van Auken,et al.  Textpresso Central: a customizable platform for searching, text mining, viewing, and curating biomedical literature , 2018, BMC Bioinformatics.

[14]  E. Gelfand,et al.  Inhibition of Pim1 kinase prevents peanut allergy by enhancing Runx3 expression and suppressing T(H)2 and T(H)17 T-cell differentiation. , 2012, The Journal of allergy and clinical immunology.

[15]  Fabio Rinaldi,et al.  Strategies towards digital and semi-automated curation in RegulonDB , 2017, Database J. Biol. Databases Curation.

[16]  Patrick Ruch,et al.  Full-texts representation with Medical Subject Headings, and co-citations network rerank- ing strategies for TREC 2014 Clinical Decision Support Track , 2014, TREC.

[17]  Clement T. Yu,et al.  A tutorial on information retrieval: basic terms and concepts , 2006, Journal of biomedical discovery and collaboration.

[18]  The Gene Ontology Consortium,et al.  Expansion of the Gene Ontology knowledgebase and resources , 2016, Nucleic Acids Res..

[19]  Patrick Ruch,et al.  Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases , 2013, Database J. Biol. Databases Curation.

[20]  Shruti Rao,et al.  MET network in PubMed: a text-mined network visualization and curation system , 2016, Database J. Biol. Databases Curation.

[21]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[22]  J. Casanova,et al.  Interleukin 1/Toll-like Receptor-induced Autophosphorylation Activates Interleukin 1 Receptor-associated Kinase 4 and Controls Cytokine Induction in a Cell Type-specific Manner , 2014, The Journal of Biological Chemistry.

[23]  Zhiyong Lu,et al.  PubTator: a web-based text mining tool for assisting biocuration , 2013, Nucleic Acids Res..

[24]  Patrick Ruch,et al.  Text Mining to Support Gene Ontology Curation and Vice Versa. , 2017, Methods in molecular biology.

[25]  Patrick Ruch,et al.  neXtA5: accelerating annotation of articles via automated approaches in neXtProt , 2016, Database J. Biol. Databases Curation.

[26]  Zhiyong Lu,et al.  Scaling up data curation using deep learning: An application to literature triage in genomic variation resources , 2018, PLoS Comput. Biol..

[27]  Amos Bairoch,et al.  Annotation of functional impact of voltage‐gated sodium channel mutations , 2017, Human mutation.

[28]  K. Bretonnel Cohen,et al.  Manual curation is not sufficient for annotation of genomic databases , 2007, ISMB/ECCB.

[29]  Thérèse Vachon,et al.  Improving average ranking precision in user searches for biomedical research datasets , 2017, Database J. Biol. Databases Curation.

[30]  Sophia Ananiadou,et al.  Argo: enabling the development of bespoke workflows and services for disease annotation , 2016, Database J. Biol. Databases Curation.

[31]  Vipin Yadav,et al.  Fyn is induced by Ras/PI3K/Akt signaling and is required for enhanced invasion/migration , 2011, Molecular carcinogenesis.

[32]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[33]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[34]  Javier Martín,et al.  Protein tyrosine phosphatase non-receptor 22 and C-Src tyrosine kinase genes are down-regulated in patients with rheumatoid arthritis , 2017, Scientific Reports.

[35]  José Luís Oliveira,et al.  Mining clinical attributes of genomic variants through assisted literature curation in Egas , 2016, Database J. Biol. Databases Curation.

[36]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[37]  Patrick Ruch,et al.  Deep Question Answering for protein annotation , 2015, Database J. Biol. Databases Curation.

[38]  Evan Bolton,et al.  Database resources of the National Center for Biotechnology Information , 2017, Nucleic Acids Res..

[39]  Jeyakumar Natarajan,et al.  Overview of the interactive task in BioCreative V , 2015, Database J. Biol. Databases Curation.

[40]  Zhiyong Lu,et al.  PubMed and beyond: a survey of web tools for searching biomedical literature , 2011, Database J. Biol. Databases Curation.

[41]  Michael Schroeder,et al.  GoPubMed: exploring PubMed with the Gene Ontology , 2005, Nucleic Acids Res..

[42]  Jung-Hsien Chiang,et al.  Overview of the gene ontology task at BioCreative IV , 2014, Database J. Biol. Databases Curation.

[43]  Satrajit S. Ghosh,et al.  Mindboggling morphometry of human brains , 2016, bioRxiv.

[44]  Amos Bairoch,et al.  The neXtProt knowledgebase on human proteins: 2017 update , 2016, Nucleic Acids Res..

[45]  Dietrich Rebholz-Schuhmann,et al.  Automatic recognition of conceptualization zones in scientific articles and two life science applications , 2012, Bioinform..