The curation paradigm and application tool used for manual curation of the scientific literature at the Comparative Toxicogenomics Database

The Comparative Toxicogenomics Database (CTD) is a public resource that promotes understanding about the effects of environmental chemicals on human health. CTD biocurators read the scientific literature and convert free-text information into a structured format using official nomenclature, integrating third party controlled vocabularies for chemicals, genes, diseases and organisms, and a novel controlled vocabulary for molecular interactions. Manual curation produces a robust, richly annotated dataset of highly accurate and detailed information. Currently, CTD describes over 349 000 molecular interactions between 6800 chemicals, 20 900 genes (for 330 organisms) and 4300 diseases that have been manually curated from over 25 400 peer-reviewed articles. This manually curated data are further integrated with other third party data (e.g. Gene Ontology, KEGG and Reactome annotations) to generate a wealth of toxicogenomic relationships. Here, we describe our approach to manual curation that uses a powerful and efficient paradigm involving mnemonic codes. This strategy allows biocurators to quickly capture detailed information from articles by generating simple statements using codes to represent the relationships between data types. The paradigm is versatile, expandable, and able to accommodate new data challenges that arise. We have incorporated this strategy into a web-based curation tool to further increase efficiency and productivity, implement quality control in real-time and accommodate biocurators working remotely. Database URL: http://ctd.mdibl.org

[1]  Allan Peter Davis,et al.  Genetic and environmental pathways to complex diseases , 2009, BMC Systems Biology.

[2]  E. F. CODD,et al.  A relational model of data for large shared data banks , 1970, CACM.

[3]  Christopher G. Chute,et al.  BioPortal: ontologies and integrated data resources at the click of a mouse , 2009, Nucleic Acids Res..

[4]  Thomas C. Wiegers,et al.  The Comparative Toxicogenomics Database: update 2011 , 2010, Nucleic Acids Res..

[5]  Judith A. Blake,et al.  Integrating text mining into the MGI biocuration workflow , 2009, Database J. Biol. Databases Curation.

[6]  Carol A. Bocchini,et al.  A new face and new challenges for Online Mendelian Inheritance in Man (OMIM®) , 2011, Human mutation.

[7]  Lincoln Stein,et al.  Reactome: a database of reactions, pathways and biological processes , 2010, Nucleic Acids Res..

[8]  Xudong Wu,et al.  Preferential regulation of miRNA targets by environmental chemicals in the human genome , 2011, BMC Genomics.

[9]  Thomas C. Wiegers,et al.  Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical–gene–disease networks , 2008, Nucleic Acids Res..

[10]  R. Judson,et al.  The Toxicity Data Landscape for Environmental Chemicals , 2008, Environmental health perspectives.

[11]  P. Landrigan,et al.  Chemical wastes, children's health, and the Superfund Basic Research Program. , 1999, Environmental health perspectives.

[12]  Thomas C. Wiegers,et al.  The Comparative Toxicogenomics Database: update 2013 , 2012, Nucleic Acids Res..

[13]  David M. Reif,et al.  Activity profiles of 309 ToxCast™ chemicals evaluated across 292 biochemical targets. , 2011, Toxicology.

[14]  Michael C. Rosenstein,et al.  The comparative toxicogenomics database: a cross-species resource for building chemical-gene interaction networks. , 2006, Toxicological sciences : an official journal of the Society of Toxicology.

[15]  Thomas C. Wiegers,et al.  The Comparative Toxicogenomics Database facilitates identification and understanding of chemical-gene-disease associations: arsenic as a case study , 2008, BMC Medical Genomics.

[16]  E. F. Codd,et al.  A relational model of data for large shared data banks , 1970, CACM.

[17]  Weisong Liu,et al.  The Rat Genome Database curation tool suite: a set of optimized software tools enabling efficient acquisition, organization, and presentation of biological data , 2011, Database J. Biol. Databases Curation.

[18]  Thomas C. Wiegers,et al.  GeneComps and ChemComps: a new CTD metric to identify genes and chemicals with shared toxicogenomic profiles , 2009, Bioinformation.

[19]  Howard L. Bleich,et al.  Technical Milestone: Medical Subject Headings Used to Search the Biomedical Literature , 2001, J. Am. Medical Informatics Assoc..

[20]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[21]  K. Bretonnel Cohen,et al.  Text mining and manual curation of chemical-gene-disease networks for the Comparative Toxicogenomics Database (CTD) , 2009, BMC Bioinformatics.

[22]  Yoshihiro Yamanishi,et al.  KEGG for linking genomes to life and the environment , 2007, Nucleic Acids Res..

[23]  Paul Davis,et al.  Methods and strategies for gene structure curation in WormBase , 2011, Database J. Biol. Databases Curation.

[24]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[25]  Judith A. Blake,et al.  Integrating Text Mining into the MGI Biocuration Workflow , 2009 .