Centralizing content and distributing labor: a community model for curating the very long tail of microbial genomes

The last 20 years of advancement in DNA sequencing technologies have led to the sequencing of thousands of microbial genomes, creating mountains of genetic data. While our efficiency in generating the data improves almost daily, applying meaningful relationships between the taxonomic and genetic entities requires a new approach. Currently, the knowledge is distributed across a fragmented landscape of resources from government-funded institutions such as NCBI and Uniprot to topic-focused databases like the ODB3 database of prokaryotic operons, to the supplemental table of a primary publication. A major drawback to large scale, expert curated databases is the expense of maintaining and extending them over time. No entity apart from a major institution with stable long-term funding can consider this, and their scope is limited considering the magnitude of microbial data being generated daily. Wikidata is an, openly editable, semantic web compatible framework for knowledge representation. It’s a project of the Wikimedia Foundation and offers knowledge integration capabilities ideally suited to the challenge of representing the exploding body of information about microbial genomics. We are developing a microbial specific data model, based on Wikidata’s semantic web compatibility, that represents bacterial species, strains and the gene and gene products that define them. Currently, we have loaded 1736 gene items and 1741 protein items for two strains of the human pathogenic bacteria Chlamydia trachomatis and used this subset of data as an example of the empowering utility of this model. In our next phase of development we will expand by adding another 118 bacterial genomes and their gene and gene products, totaling over ~900,000 additional entities. This aggregation of knowledge will be a platform for community-driven collaboration, allowing the networking of microbial genetic data through the sharing of knowledge by both the data and domain expert.

[1]  Ming Tan,et al.  Molecular Mechanism of Tryptophan-Dependent Transcriptional Regulation in Chlamydia trachomatis , 2006, Journal of bacteriology.

[2]  Jonathan D. Wren,et al.  URL decay in MEDLINE - a 4-year follow-up study , 2008, Bioinform..

[3]  Rosanna Peeling,et al.  Polymorphisms in Chlamydia trachomatis tryptophan synthase genes differentiate between genital and ocular isolates. , 2003, The Journal of clinical investigation.

[4]  Rob Knight,et al.  The Earth Microbiome project: successes and aspirations , 2014, BMC Biology.

[5]  J. Tachezy,et al.  Fluorescence in situ hybridization (FISH) mapping of single copy genes on Trichomonas vaginalis chromosomes. , 2011, Molecular and biochemical parasitology.

[6]  S. Grieshaber,et al.  Influence of the tryptophan-indole-IFNγ axis on human genital Chlamydia trachomatis infection: role of vaginal co-infections , 2014, Front. Cell. Infect. Microbiol..

[7]  Winston A Hide,et al.  Big data: The future of biocuration , 2008, Nature.

[8]  Benjamin M. Good,et al.  Wikidata: A platform for data integration and dissemination for the life sciences and beyond , 2015, bioRxiv.

[9]  Jean M. Macklaim,et al.  Changes in vaginal microbiota following antimicrobial and probiotic therapy , 2015, Microbial ecology in health and disease.

[10]  Jon W. Huss,et al.  A Gene Wiki for Community Annotation of Gene Function , 2008, PLoS biology.

[11]  Mark A. Schembri,et al.  Comparative Genomics of Escherichia coli Strains Causing Urinary Tract Infections , 2011, Applied and Environmental Microbiology.

[12]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.