Catching Inconsistencies with the Semantic Web: A Biocuration Case Study

Background The UniProtKB/Swiss-Prot database is manually curated by a team of experienced biocurators with the aim to provide to the scientific community high-quality information on proteins. Ensuring a high-quality curation standard depends in part on effective tools that help curators to avoid trivial mistakes during data curation. Description We describe here a system that is using SPARQL queries encoded in SPIN to identify UniProtKB database records that do not comply with manual curation rules. The system must generate specific and accurate warnings for curators by correctly defining known exceptions to general rules. Conclusions Semantic web technologies such as SPARQL queries are a good way to encode quality control rules for manual curation efforts in the life sciences because they are simple and cheap to maintain. This is an important factor in the face of continuously growing and evolving knowledge about biology. The results of SPARQL queries can be presented in a user-friendly way to help curators with data correction.

[1]  Prudence Mutowo-Meullenet,et al.  Manual GO annotation of predictive protein signatures: the InterPro approach to GO curation , 2012, Database J. Biol. Databases Curation.

[2]  Robert D. Finn,et al.  InterPro in 2011: new developments in the family and domain prediction database , 2011, Nucleic acids research.

[3]  Michel Dumontier,et al.  Integrating systems biology models and biomedical ontologies , 2011, BMC Systems Biology.

[4]  Masao Nagasaki,et al.  Ontology-based instance data validation for high-quality curated biological pathways , 2011, BMC Bioinformatics.

[5]  Ying Cheng,et al.  The European Nucleotide Archive , 2010, Nucleic Acids Res..

[6]  C. O’Donovan,et al.  UniRule – Automatic Annotation In UniProtKB , 2010 .

[7]  Baris E. Suzek,et al.  The Universal Protein Resource (UniProt) in 2010 , 2009, Nucleic Acids Res..

[8]  María Martín,et al.  The Universal Protein Resource (UniProt) in 2010 , 2010 .

[9]  Patricia C. Babbitt,et al.  Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies , 2009, PLoS Comput. Biol..

[10]  Michael Zouberakis,et al.  Models for financial sustainability of biological databases and resources , 2009, Database J. Biol. Databases Curation.

[11]  Iddo Friedberg,et al.  Automated protein function predictionçthe genomic challenge , 2006 .

[12]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2004, Nucleic Acids Res..

[13]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.