Text mining and manual curation of chemical-gene-disease networks for the Comparative Toxicogenomics Database (CTD)

BackgroundThe Comparative Toxicogenomics Database (CTD) is a publicly available resource that promotes understanding about the etiology of environmental diseases. It provides manually curated chemical-gene/protein interactions and chemical- and gene-disease relationships from the peer-reviewed, published literature. The goals of the research reported here were to establish a baseline analysis of current CTD curation, develop a text-mining prototype from readily available open source components, and evaluate its potential value in augmenting curation efficiency and increasing data coverage.ResultsPrototype text-mining applications were developed and evaluated using a CTD data set consisting of manually curated molecular interactions and relationships from 1,600 documents. Preliminary results indicated that the prototype found 80% of the gene, chemical, and disease terms appearing in curated interactions. These terms were used to re-rank documents for curation, resulting in increases in mean average precision (63% for the baseline vs. 73% for a rule-based re-ranking), and in the correlation coefficient of rank vs. number of curatable interactions per document (baseline 0.14 vs. 0.38 for the rule-based re-ranking).ConclusionThis text-mining project is unique in its integration of existing tools into a single workflow with direct application to CTD. We performed a baseline assessment of the inter-curator consistency and coverage in CTD, which allowed us to measure the potential of these integrated tools to improve prioritization of journal articles for manual curation. Our study presents a feasible and cost-effective approach for developing a text mining solution to enhance manual curation throughput and efficiency.

[1]  Hans-Michael Müller,et al.  Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature , 2004, PLoS biology.

[2]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[3]  A. Valencia,et al.  Linking genes to literature: text mining, information extraction, and retrieval applications for biology , 2008, Genome Biology.

[4]  Hao Chen,et al.  Content-rich biological network constructed by mining PubMed abstracts , 2004, BMC Bioinformatics.

[5]  Carolyn J. Mattingly,et al.  Perturbation of Defense Pathways by Low-Dose Arsenic Exposure in Zebrafish Embryos , 2009, Environmental health perspectives.

[6]  Lawrence Hunter,et al.  Pacific symposium on biocomputing 2006 , 2005, PSB 2016.

[7]  Burr Settles,et al.  ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[8]  Luis López-Maury,et al.  The Glutathione/Glutaredoxin System Is Essential for Arsenate Reduction in Synechocystis sp. Strain PCC 6803 , 2009, Journal of bacteriology.

[9]  Michael Schroeder,et al.  Facts from text: can text mining help to scale-up high-quality manual curation of gene products with ontologies? , 2008, Briefings Bioinform..

[10]  Emily Dimmer,et al.  An evaluation of GO annotation retrieval for BioCreAtIvE and GOA , 2005, BMC Bioinformatics.

[11]  Graciela Gonzalez,et al.  BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition , 2007, Pacific Symposium on Biocomputing.

[12]  Otis Gospodnetic,et al.  Lucene in Action , 2004 .

[13]  Paul B. Tchounwou,et al.  Prevalence of Selected Bacterial Infections Associated with the Use of Animal Waste in Louisiana , 2005, International journal of environmental research and public health.

[14]  L. Grivell,et al.  Text mining for biology - the way forward: opinions from leading scientists , 2008, Genome Biology.

[15]  Thomas C. Wiegers,et al.  Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical–gene–disease networks , 2008, Nucleic Acids Res..

[16]  Peter Murray-Rust,et al.  High-Throughput Identification of Chemistry in Life Science Texts , 2006, CompLife.

[17]  William A. Toscano,et al.  Systems Biology: New Approaches to Old Environmental Health Problems , 2005, International journal of environmental research and public health.

[18]  Peter T. Corbett,et al.  Cascaded classifiers for confidence-based chemical named entity recognition , 2008, BMC Bioinformatics.

[19]  Dietrich Rebholz-Schuhmann,et al.  Text processing through Web services: calling Whatizit , 2008, Bioinform..

[20]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information , 2021, Nucleic Acids Res..

[21]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[22]  Yoshihiro Yamanishi,et al.  KEGG for linking genomes to life and the environment , 2007, Nucleic Acids Res..

[23]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[24]  Daniel Hanisch,et al.  ProMiner: rule-based protein and gene entity recognition , 2005, BMC Bioinformatics.

[25]  K. E. Ravikumar,et al.  An online literature mining tool for protein phosphorylation , 2006, Bioinform..

[26]  Russ B. Altman,et al.  Pharmspresso: a text mining tool for extraction of pharmacogenomic concepts and relationships from full text , 2009, BMC Bioinformatics.

[27]  Jimmy J. Lin Is searching full text more effective than searching abstracts? , 2009, BMC Bioinformatics.

[28]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[29]  Alfonso Valencia,et al.  Implementing the iHOP concept for navigation of biomedical literature , 2005, ECCB/JBI.

[30]  C. Mattingly Chemical databases for environmental health and clinical research. , 2009, Toxicology letters.

[31]  Thomas C. Wiegers,et al.  The Comparative Toxicogenomics Database facilitates identification and understanding of chemical-gene-disease associations: arsenic as a case study , 2008, BMC Medical Genomics.

[32]  Dietrich Rebholz-Schuhmann,et al.  Assessment of disease named entity recognition on a corpus of annotated sentences , 2008, BMC Bioinformatics.