论文信息 - Text mining and manual curation of chemical-gene-disease networks for the Comparative Toxicogenomics Database (CTD) - 字舞流文

Text mining and manual curation of chemical-gene-disease networks for the Comparative Toxicogenomics Database (CTD)

BackgroundThe Comparative Toxicogenomics Database (CTD) is a publicly available resource that promotes understanding about the etiology of environmental diseases. It provides manually curated chemical-gene/protein interactions and chemical- and gene-disease relationships from the peer-reviewed, published literature. The goals of the research reported here were to establish a baseline analysis of current CTD curation, develop a text-mining prototype from readily available open source components, and evaluate its potential value in augmenting curation efficiency and increasing data coverage.ResultsPrototype text-mining applications were developed and evaluated using a CTD data set consisting of manually curated molecular interactions and relationships from 1,600 documents. Preliminary results indicated that the prototype found 80% of the gene, chemical, and disease terms appearing in curated interactions. These terms were used to re-rank documents for curation, resulting in increases in mean average precision (63% for the baseline vs. 73% for a rule-based re-ranking), and in the correlation coefficient of rank vs. number of curatable interactions per document (baseline 0.14 vs. 0.38 for the rule-based re-ranking).ConclusionThis text-mining project is unique in its integration of existing tools into a single workflow with direct application to CTD. We performed a baseline assessment of the inter-curator consistency and coverage in CTD, which allowed us to measure the potential of these integrated tools to improve prioritization of journal articles for manual curation. Our study presents a feasible and cost-effective approach for developing a text mining solution to enhance manual curation throughput and efficiency.

K. Bretonnel Cohen | Thomas C. Wiegers | Allan Peter Davis | Carolyn J. Mattingly | Lynette Hirschman | A. P. Davis | K. Cohen | L. Hirschman | C. Mattingly | K. B. Cohen

[1] Hans-Michael Müller,et al. Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature , 2004, PLoS biology.

[2] José Luis Vicedo González,et al. TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[3] A. Valencia,et al. Linking genes to literature: text mining, information extraction, and retrieval applications for biology , 2008, Genome Biology.

[4] Hao Chen,et al. Content-rich biological network constructed by mining PubMed abstracts , 2004, BMC Bioinformatics.

[5] Carolyn J. Mattingly,et al. Perturbation of Defense Pathways by Low-Dose Arsenic Exposure in Zebrafish Embryos , 2009, Environmental health perspectives.

[6] Lawrence Hunter,et al. Pacific symposium on biocomputing 2006 , 2005, PSB 2016.

[7] Burr Settles,et al. ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[8] Luis López-Maury,et al. The Glutathione/Glutaredoxin System Is Essential for Arsenate Reduction in Synechocystis sp. Strain PCC 6803 , 2009, Journal of bacteriology.

[9] Michael Schroeder,et al. Facts from text: can text mining help to scale-up high-quality manual curation of gene products with ontologies? , 2008, Briefings Bioinform..

[10] Emily Dimmer,et al. An evaluation of GO annotation retrieval for BioCreAtIvE and GOA , 2005, BMC Bioinformatics.

[11] Graciela Gonzalez,et al. BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition , 2007, Pacific Symposium on Biocomputing.

[12] Otis Gospodnetic,et al. Lucene in Action , 2004 .

[13] Paul B. Tchounwou,et al. Prevalence of Selected Bacterial Infections Associated with the Use of Animal Waste in Louisiana , 2005, International journal of environmental research and public health.

[14] L. Grivell,et al. Text mining for biology - the way forward: opinions from leading scientists , 2008, Genome Biology.

[15] Thomas C. Wiegers,et al. Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical–gene–disease networks , 2008, Nucleic Acids Res..

[16] Peter Murray-Rust,et al. High-Throughput Identification of Chemistry in Life Science Texts , 2006, CompLife.

[17] William A. Toscano,et al. Systems Biology: New Approaches to Old Environmental Health Problems , 2005, International journal of environmental research and public health.

[18] Peter T. Corbett,et al. Cascaded classifiers for confidence-based chemical named entity recognition , 2008, BMC Bioinformatics.

[19] Dietrich Rebholz-Schuhmann,et al. Text processing through Web services: calling Whatizit , 2008, Bioinform..

[20] Gregory D. Schuler,et al. Database resources of the National Center for Biotechnology Information , 2021, Nucleic Acids Res..

[21] Gene Ontology Consortium. The Gene Ontology (GO) database and informatics resource , 2003 .

[22] Yoshihiro Yamanishi,et al. KEGG for linking genomes to life and the environment , 2007, Nucleic Acids Res..

[23] Alan R. Aronson,et al. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[24] Daniel Hanisch,et al. ProMiner: rule-based protein and gene entity recognition , 2005, BMC Bioinformatics.

[25] K. E. Ravikumar,et al. An online literature mining tool for protein phosphorylation , 2006, Bioinform..

[26] Russ B. Altman,et al. Pharmspresso: a text mining tool for extraction of pharmacogenomic concepts and relationships from full text , 2009, BMC Bioinformatics.

[27] Jimmy J. Lin. Is searching full text more effective than searching abstracts? , 2009, BMC Bioinformatics.

[28] Gregory D. Schuler,et al. Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[29] Alfonso Valencia,et al. Implementing the iHOP concept for navigation of biomedical literature , 2005, ECCB/JBI.

[30] C. Mattingly. Chemical databases for environmental health and clinical research. , 2009, Toxicology letters.

[31] Thomas C. Wiegers,et al. The Comparative Toxicogenomics Database facilitates identification and understanding of chemical-gene-disease associations: arsenic as a case study , 2008, BMC Medical Genomics.

[32] Dietrich Rebholz-Schuhmann,et al. Assessment of disease named entity recognition on a corpus of annotated sentences , 2008, BMC Bioinformatics.