Wikidata as a semantic framework for the Gene Wiki initiative

Open biological data is distributed over many resources making it challenging to integrate, to update and to disseminate quickly. Wikidata is a growing, open community database which can serve this purpose and also provides tight integration with Wikipedia. In order to improve the state of biological data, facilitate data management and dissemination, we imported all human and mouse genes, and all human and mouse proteins into Wikidata. In total, 59,530 human genes and 73,130 mouse genes have been imported from NCBI and 27,662 human proteins and 16,728 mouse proteins have been imported from the Swissprot subset of UniProt. As Wikidata is open and can be edited by anybody, our corpus of imported data serves as the starting point for integration of further data by scientists, the Wikidata community and citizen scientists alike. The first use case for this data is to populate Wikipedia Gene Wiki infoboxes directly from Wikidata with the data integrated above. This enables immediate updates of the Gene Wiki infoboxes as soon as the data in Wikidata is modified. Although Gene Wiki pages are currently only on the English language version of Wikipedia, the multilingual nature of Wikidata allows for a usage of the data we imported in all 280 different language Wikipedias. Apart from the Gene Wiki infobox use case, a powerful SPARQL endpoint and up to date exporting functionality (e.g. JSON, XML) enable very convenient further use of the data by scientists. In summary, we created a fully open and extensible data resource for human and mouse molecular biology and biochemistry data. This resource enriches all the Wikipedias with structured information and serves as a new linking hub for the biological semantic web.

[1]  Luca de Alfaro,et al.  The Gene Wiki in 2011: community intelligence applied to human gene annotation , 2011, Nucleic Acids Res..

[2]  Deanna M. Church,et al.  ClinVar: public archive of relationships among sequence variation and human phenotype , 2013, Nucleic Acids Res..

[3]  Jon W. Huss,et al.  A Gene Wiki for Community Annotation of Gene Function , 2008, PLoS biology.

[4]  Mingming Jia,et al.  COSMIC: exploring the world's knowledge of somatic mutations in human cancer , 2014, Nucleic Acids Res..

[5]  Markus Krötzsch,et al.  Semantic Wikipedia , 2006, WikiSym '06.

[6]  Andrew I. Su,et al.  The Gene Wiki: community intelligence applied to human gene annotation , 2009, Nucleic Acids Res..

[7]  J. Tate,et al.  The RNA WikiProject: community annotation of RNA families. , 2008, RNA.

[8]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[9]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[10]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.

[11]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[12]  Judith A. Blake,et al.  The Mouse Genome Database (MGD): facilitating mouse as a model for human biology and disease , 2014, Nucleic Acids Res..

[13]  Joanna L. Sharman,et al.  The IUPHAR/BPS Guide to PHARMACOLOGY in 2016: towards curated quantitative interactions between 1300 protein targets and 6000 ligands , 2015, Nucleic Acids Res..

[14]  S. Batalov,et al.  A gene atlas of the mouse and human protein-encoding transcriptomes. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Benjamin M. Good,et al.  Building a biomedical semantic network in Wikipedia with Semantic Wiki Links , 2012, Database J. Biol. Databases Curation.

[16]  Caroline F. Wright,et al.  DECIPHER: database for the interpretation of phenotype-linked plausibly pathogenic sequence and copy-number variation , 2013, Nucleic Acids Res..

[17]  Elspeth A. Bruford,et al.  Genenames.org: the HGNC resources in 2015 , 2014, Nucleic Acids Res..