Extending pathways based on gene lists using InterPro domain signatures

BackgroundHigh-throughput technologies like functional screens and gene expression analysis produce extended lists of candidate genes. Gene-Set Enrichment Analysis is a commonly used and well established technique to test for the statistically significant over-representation of particular pathways. A shortcoming of this method is however, that most genes that are investigated in the experiments have very sparse functional or pathway annotation and therefore cannot be the target of such an analysis. The approach presented here aims to assign lists of genes with limited annotation to previously described functional gene collections or pathways. This works by comparing InterPro domain signatures of the candidate gene lists with domain signatures of gene sets derived from known classifications, e.g. KEGG pathways.ResultsIn order to validate our approach, we designed a simulation study. Based on all pathways available in the KEGG database, we create test gene lists by randomly selecting pathway genes, removing these genes from the known pathways and adding variable amounts of noise in the form of genes not annotated to the pathway. We show that we can recover pathway memberships based on the simulated gene lists with high accuracy. We further demonstrate the applicability of our approach on a biological example.ConclusionResults based on simulation and data analysis show that domain based pathway enrichment analysis is a very sensitive method to test for enrichment of pathways in sparsely annotated lists of genes. An R based software package domainsignatures, to routinely perform this analysis on the results of high-throughput screening, is available via Bioconductor.

[1]  Jo Wixon Pathway Databases , 2001, Comparative and functional genomics.

[2]  T. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2006, Nucleic Acids Res..

[3]  Achim Tresch,et al.  Discrimination of Direct and Indirect Interactions in a Network of Regulatory Effects , 2007, J. Comput. Biol..

[4]  Holger Fröhlich,et al.  Large scale statistical inference of signaling pathways from RNAi and microarray data , 2007, BMC Bioinformatics.

[5]  Charles Darwin,et al.  Experiments , 1800, The Medical and physical journal.

[6]  Susumu Goto,et al.  The KEGG resource for deciphering the genome , 2004, Nucleic Acids Res..

[7]  Roland Eils,et al.  Group testing for pathway analysis improves comparability of different microarray datasets , 2006, Bioinform..

[8]  E. Birney,et al.  The International Protein Index: An integrated database for proteomics experiments , 2004, Proteomics.

[9]  T. Beißbarth,et al.  Interpreting experimental results using gene ontologies. , 2006, Methods in enzymology.

[10]  Joaquín Dopazo,et al.  The role of the environment in Parkinson's disease. , 1996, Nucleic Acids Res..

[11]  Thomas Lengauer,et al.  Improved scoring of functional groups from gene expression data by decorrelating GO graph structure , 2006, Bioinform..

[12]  Huaiyu Mi,et al.  Ontology annotation: mapping genomic regions to biological function. , 2007, Current opinion in chemical biology.

[13]  Rahul Raman,et al.  Structural insights into biological roles of protein-glycosaminoglycan interactions. , 2005, Chemistry & biology.

[14]  I. Kovalszky,et al.  Proteoglycans and tumor progression: Janus-faced molecules with contradictory functions in cancer. , 2002, Seminars in cancer biology.

[15]  Bart De Moor,et al.  BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis , 2005, Bioinform..

[16]  Robert D. Finn,et al.  New developments in the InterPro database , 2007, Nucleic Acids Res..

[17]  M. Humphries,et al.  Cytoplasmic interactions of syndecan-4 orchestrate adhesion receptor and growth factor receptor signalling. , 2002, The Biochemical journal.

[18]  Holger Fröhlich,et al.  GOSim – an R-package for computation of information theoretic GO similarities between terms and gene products , 2007, BMC Bioinformatics.

[19]  F. Maquart,et al.  Involvement of stromal proteoglycans in tumour progression. , 2004, Critical reviews in oncology/hematology.

[20]  T. Speed,et al.  GOstat: find statistically overrepresented Gene Ontologies within a group of genes. , 2004, Bioinformatics.