Statistical analysis of genomic protein family and domain controlled annotations for functional investigation of classified gene lists

BackgroundThe increasing protein family and domain based annotations constitute important information to understand protein functions and gain insight into relations among their codifying genes. To allow analyzing of gene proteomic annotations, we implemented novel modules within GFINDer, a Web system we previously developed that dynamically aggregates functional and phenotypic annotations of user-uploaded gene lists and allows performing their statistical analysis and mining.ResultsExploiting protein information in Pfam and InterPro databanks, we developed and added in GFINDer original modules specifically devoted to the exploration and analysis of functional signatures of gene protein products. They allow annotating numerous user-classified nucleotide sequence identifiers with controlled information on related protein families, domains and functional sites, classifying them according to such protein annotation categories, and statistically analyzing the obtained classifications. In particular, when uploaded nucleotide sequence identifiers are subdivided in classes, the Statistics Protein Families&Domains module allows estimating relevance of Pfam or InterPro controlled annotations for the uploaded genes by highlighting protein signatures significantly more represented within user-defined classes of genes. In addition, the Logistic Regression module allows identifying protein functional signatures that better explain the considered gene classification.ConclusionNovel GFINDer modules provide genomic protein family and domain analyses supporting better functional interpretation of gene classes, for instance defined through statistical and clustering analyses of gene expression results from microarray experiments. They can hence help understanding fundamental biological processes and complex cellular mechanisms influenced by protein domain composition, and contribute to unveil new biomedical knowledge about the codifying genes.

[1]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[2]  D. Green Apoptotic Pathways Paper Wraps Stone Blunts Scissors , 2000, Cell.

[3]  Lloyd D. Fisher,et al.  2. Biostatistics: A Methodology for the Health Sciences , 1994 .

[4]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[5]  Douglas L Falls,et al.  Neuregulins: functions, forms, and signaling strategies. , 2003, Experimental cell research.

[6]  Michael Y. Galperin The Molecular Biology Database Collection: 2006 update , 2005, Nucleic Acids Res..

[7]  D. Hosmer,et al.  Applied Logistic Regression , 1991 .

[8]  Francesco Pinciroli,et al.  GFINDer: Genome Function INtegrated Discoverer through dynamic annotation, statistical analysis, and mining , 2004, Nucleic Acids Res..

[9]  G. Casella,et al.  Statistical Inference , 2003, Encyclopedia of Social Network Analysis and Mining.

[10]  Giovanni Parmigiani,et al.  Mutational Analysis of the Tyrosine Phosphatome in Colorectal Cancers , 2004, Science.

[11]  Joaquín Dopazo,et al.  BABELOMICS: a suite of web tools for functional annotation and analysis of groups of genes in high-throughput experiments , 2005, Nucleic Acids Res..

[12]  Brad T. Sherman,et al.  DAVID: Database for Annotation, Visualization, and Integrated Discovery , 2003, Genome Biology.

[13]  Robert D. Finn,et al.  Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[14]  T. Hunter,et al.  The Protein Kinase Complement of the Human Genome , 2002, Science.

[15]  J. Concato,et al.  A simulation study of the number of events per variable in logistic regression analysis. , 1996, Journal of clinical epidemiology.

[16]  P. Bork,et al.  Protein domain analysis in the era of complete genomes , 2002, FEBS letters.

[17]  Giovanni Parmigiani,et al.  Mutational Analysis of the Tyrosine Kinome in Colorectal Cancers , 2003, Nature Reviews Cancer.

[18]  Ronnie Driver,et al.  Biostatistics: a Methodology for the Health Sciences , 2005 .

[19]  Allan R. Jones,et al.  Neurogenomics: at the intersection of neurobiology and genome sciences , 2004, Nature Neuroscience.

[20]  T. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2006, Nucleic Acids Res..

[21]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt): an expanding universe of protein information , 2005, Nucleic Acids Res..

[22]  Despina Sanoudou,et al.  Array lessons from the heart: focus on the genome and transcriptome of cardiomyopathies. , 2005, Physiological genomics.

[23]  Cathy H. Wu,et al.  InterPro, progress and status in 2005 , 2004, Nucleic Acids Res..