Protein classification using ontology classification

MOTIVATION The classification of proteins expressed by an organism is an important step in understanding the molecular biology of that organism. Traditionally, this classification has been performed by human experts. Human knowledge can recognise the functional properties that are sufficient to place an individual gene product into a particular protein family group. Automation of this task usually fails to meet the 'gold standard' of the human annotator because of the difficult recognition stage. The growing number of genomes, the rapid changes in knowledge and the central role of classification in the annotation process, however, motivates the need to automate this process. RESULTS We capture human understanding of how to recognise members of the protein phosphatases family by domain architecture as an ontology. By describing protein instances in terms of the domains they contain, it is possible to use description logic reasoners and our ontology to assign those proteins to a protein family class. We have tested our system on classifying the protein phosphatases of the human and Aspergillus fumigatus genomes and found that our knowledge-based, automatic classification matches, and sometimes surpasses, that of the human annotators. We have made the classification process fast and reproducible and, where appropriate knowledge is available, the method can potentially be generalised for use with any protein family. AVAILABILITY All components described in this paper are freely available. OWL ontology http://www.bioinf.man.ac.uk/phosphabase myGrid http://www.mygrid.org.uk Instance Store http://instancestore.man.ac.uk.

[1]  Peer Bork,et al.  SMART 4.0: towards genomic data integration , 2004, Nucleic Acids Res..

[2]  Norman W. Paton,et al.  CADRE: the Central Aspergillus Data REpository. , 2004 .

[3]  Diego Calvanese,et al.  The Description Logic Handbook: Theory, Implementation, and Applications , 2003, Description Logic Handbook.

[4]  P. R. Kraus,et al.  Coping with stress: calmodulin and calcineurin in model and pathogenic fungi. , 2003, Biochemical and biophysical research communications.

[5]  Carole A. Goble,et al.  Exploring Williams-Beuren syndrome using myGrid , 2004, ISMB/ECCB.

[6]  D. Barford,et al.  The structure and mechanism of protein phosphatases: insights into catalysis and regulation. , 1998, Annual review of biophysics and biomolecular structure.

[7]  I. Horrocks,et al.  The Instance Store: DL Reasoning with Large Numbers of Individuals , 2004, Description Logics.

[8]  Robert Stevens,et al.  Constructing ontology-driven protein family databases , 2005, Bioinform..

[9]  Carole A. Goble,et al.  Ontologies in Bioinformatics , 2004, Handbook on Ontologies.

[10]  Anirban Bhaduri,et al.  A genome-wide survey of human tyrosine phosphatases. , 2003, Protein engineering.

[11]  A. Schönthal,et al.  Role of serine/threonine protein phosphatase 2A in cancer. , 2001, Cancer letters.

[12]  Z. Zhang,et al.  Protein tyrosine phosphatases: prospects for therapeutics. , 2001, Current opinion in chemical biology.

[13]  P. Cohen,et al.  Novel protein serine/threonine phosphatases: variety is the spice of life. , 1997, Trends in biochemical sciences.

[14]  M. Tremblay,et al.  Functional significance of the LAR receptor protein tyrosine phosphatase family in development and diseases. , 2004, Biochemistry and cell biology = Biochimie et biologie cellulaire.

[15]  Jonathan E. Allen,et al.  Computational gene prediction using multiple sources of evidence. , 2003, Genome research.

[16]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[17]  Qing Tian,et al.  Role of Serine/Threonine Protein Phosphatase in Alzheimer’s Disease , 2002, Neurosignals.

[18]  J. Dixon,et al.  A Unique Carbohydrate Binding Domain Targets the Lafora Disease Phosphatase to Glycogen* , 2002, The Journal of Biological Chemistry.

[19]  Cathy H. Wu,et al.  InterPro, progress and status in 2005 , 2004, Nucleic Acids Res..

[20]  Norman W. Paton,et al.  CADRE: the Central Aspergillus Data REpository , 2004, Nucleic Acids Res..

[21]  A. Shatkin,et al.  Human mRNA capping enzyme (RNGTT) and cap methyltransferase (RNMT) map to 6q16 and 18p11.22-p11.23, respectively. , 1998, Genomics.

[22]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[23]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[24]  Joanna M. Sasin,et al.  Protein Tyrosine Phosphatases in the Human Genome , 2004, Cell.

[25]  Brian Adams,et al.  A zinc-binding dual-specificity YVH1 phosphatase in the malaria parasite, Plasmodium falciparum, and its interaction with the nuclear protein, pescadillo. , 2004, Molecular and biochemical parasitology.

[26]  Andrew F Neuwald,et al.  Computational analysis of protein tyrosine phosphatases: practical guide to bioinformatics and data resources. , 2005, Methods.

[27]  B Marshall,et al.  Gene Ontology Consortium: The Gene Ontology (GO) database and informatics resource , 2004, Nucleic Acids Res..

[28]  Amos Bairoch,et al.  Recent improvements to the PROSITE database , 2004, Nucleic Acids Res..

[29]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[30]  T. Oinn,et al.  Soaplab - a unified Sesame door to analysis tools , 2003 .

[31]  Toshiyuki Fukada,et al.  A genomic perspective on protein tyrosine phosphatases: gene structure, pseudogenes, and genetic disease linkage , 2004, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.