Predicting pathway membership via domain signatures

Motivation: Functional characterization of genes is of great importance for the understanding of complex cellular processes. Valuable information for this purpose can be obtained from pathway databases, like KEGG. However, only a small fraction of genes is annotated with pathway information up to now. In contrast, information on contained protein domains can be obtained for a significantly higher number of genes, e.g. from the InterPro database. Results: We present a classification model, which for a specific gene of interest can predict the mapping to a KEGG pathway, based on its domain signature. The classifier makes explicit use of the hierarchical organization of pathways in the KEGG database. Furthermore, we take into account that a specific gene can be mapped to different pathways at the same time. The classification method produces a scoring of all possible mapping positions of the gene in the KEGG hierarchy. Evaluations of our model, which is a combination of a SVM and ranking perceptron approach, show a high prediction performance. Moreover, for signaling pathways we reveal that it is even possible to forecast accurately the membership to individual pathway components. Availability: The R package gene2pathway is a supplement to this article. Contact: h.froehlich@dkfz-heidelberg.de Supplementary Information: Supplementary data are available at Bioinformatics online.

[1]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[2]  Elizabeth Pennisi,et al.  Working the (Gene Count) Numbers: Finally, a Firm Answer? , 2007, Science.

[3]  Martin Vingron,et al.  Variance stabilization applied to microarray data calibration and to the quantification of differential expression , 2002, ISMB.

[4]  B Marshall,et al.  Gene Ontology Consortium: The Gene Ontology (GO) database and informatics resource , 2004, Nucleic Acids Res..

[5]  T Yoshikawa,et al.  New variants of the human and rat nuclear hormone receptor, TR4: expression and chromosomal localization of the human gene. , 1996, Genomics.

[6]  Tim Beißbarth,et al.  Extending pathways based on gene lists using InterPro domain signatures , 2008, BMC Bioinformatics.

[7]  M. Mori,et al.  Characterization of the novel mitochondrial protein import component, Tom34, in mammalian cells. , 1999, Journal of biochemistry.

[8]  Yoshihiro Yamanishi,et al.  KEGG for linking genomes to life and the environment , 2007, Nucleic Acids Res..

[9]  Jason Weston,et al.  Multi-class Protein Classification Using Adaptive Codes , 2007, J. Mach. Learn. Res..

[10]  Jeremy G. Siek,et al.  The Boost Graph Library - User Guide and Reference Manual , 2001, C++ in-depth series.

[11]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[12]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[13]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[14]  Andreas Prlic,et al.  Ensembl 2008 , 2007, Nucleic Acids Res..

[15]  Ulrich Mansmann,et al.  Identification of a common gene expression signature in dilated cardiomyopathy across independent microarray studies. , 2006, Journal of the American College of Cardiology.

[16]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[17]  Robert D. Finn,et al.  New developments in the InterPro database , 2007, Nucleic Acids Res..