Annotating Genes of Known and Unknown Function by Large-Scale Co-Expression Analysis

About 40% of the proteins encoded in eukaryotic genomes are proteins of unknown function (PUFs). Their functional characterization remains one of the main challenges in modern biology. In this study we identified the PUF encoding genes from Arabidopsis thaliana using a combination of sequence similarity, domain-based and empirical approaches. Large-scale gene expression analyses of 1310 publicly available Affymetrix chips were performed to associate the identified PUF genes with regulatory networks and biological processes of known function. To generate quality results, the study was restricted to expression sets with replicated samples. First, genome-wide clustering and gene function enrichment analysis of clusters allowed us to associate 1,541 PUF genes with tightly coexpressed genes for proteins of known function (PKFs). Over 70% of them could be assigned to more specific Biological Process annotations than the ones available in the current Gene Ontology release. The most highly over-represented functional categories in the obtained clusters were ribosome assembly, photosynthesis and cell wall pathways. Interestingly, the majority of the PUF genes appeared to be controlled by the same regulatory networks as most PKF genes, because clusters enriched in PUF genes were extremely rare. Second, large-scale analysis of differentially expressed genes (DEGs) was applied to identify a comprehensive set of abiotic stress response genes. This analysis resulted in the identification of 269 PKF and 104 PUF genes that responded to a wide variety of abiotic stresses, while 608 PKF and 206 PUF genes responded predominantly to specific stress treatments. The provided coexpression and DEG data represent an important resource for guiding future functional characterization experiments of PUF and PKF genes. Finally, the public Plant Gene Expression Database (PED, URL: http://bioweb.ucr.edu/PED) was developed as part of this project to provide efficient access and mining tools for the vast gene expression data of this study.

[1]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[2]  Qingqiu Gong,et al.  An Arabidopsis gene network based on the graphical Gaussian model. , 2007, Genome research.

[3]  R. Mittler,et al.  POFs: what we don't know can hurt us. , 2007, Trends in plant science.

[4]  A. Igamberdiev,et al.  Metabolic effects of hemoglobin gene expression in plants. , 2007, Gene.

[5]  Kai Wang,et al.  Comparative analysis of microarray normalization procedures: effects on reverse engineering gene networks , 2007, ISMB/ECCB.

[6]  E. Bornberg-Bauer,et al.  The AtGenExpress global stress expression data set: protocols, evaluation and model data analysis of UV-B light, drought and cold stress responses. , 2007, The Plant journal : for cell and molecular biology.

[7]  J. Davis Bioinformatics and Computational Biology Solutions Using R and Bioconductor , 2007 .

[8]  Francesca Chiaromonte,et al.  Qualitative network models and genome-wide expression data define carbon/nitrogen-responsive molecular machines in Arabidopsis , 2007, Genome Biology.

[9]  Robert Gentleman,et al.  Using GOstats to test gene lists for GO term association , 2007, Bioinform..

[10]  Yves Gibon,et al.  PageMan: An interactive ontology tool to generate, display, and annotate overview graphs for profiling experiments , 2006, BMC Bioinformatics.

[11]  Dennis B. Troup,et al.  NCBI GEO: mining tens of millions of expression profiles—database and tools update , 2006, Nucleic Acids Res..

[12]  Y. van de Peer,et al.  Identification of novel regulatory modules in dicotyledonous plants using expression data and comparative genomics , 2006, Genome Biology.

[13]  Rainer Breitling,et al.  RankProd: a bioconductor package for detecting differentially expressed genes in meta-analysis , 2006, Bioinform..

[14]  Jing Liu,et al.  A traveling salesman approach for predicting protein functions , 2006, Source Code for Biology and Medicine.

[15]  Li Yang,et al.  Large-Scale cis-Element Detection by Analysis of Correlated Expression and Sequence Conservation between Arabidopsis and Brassica oleracea1[W] , 2006, Plant Physiology.

[16]  A. Loraine,et al.  Transcriptional Coordination of the Metabolic Network in Arabidopsis1[W][OA] , 2006, Plant Physiology.

[17]  A. K. Grennan Genevestigator. Facilitating Web-Based Gene-Expression Analysis , 2006, Plant Physiology.

[18]  Thomas Girke,et al.  What makes species unique? The contribution of proteins with obscure features , 2006, Genome Biology.

[19]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[20]  Chih-Hung Jen,et al.  The Arabidopsis co-expression tool (ACT): a WWW-based tool and database for microarray-based gene expression analysis. , 2006, The Plant journal : for cell and molecular biology.

[21]  Chris F. Taylor,et al.  The MGED Ontology: a resource for semantics-based description of microarray experiments , 2006, Bioinform..

[22]  Howard J. Edenberg,et al.  Effects of filtering by Present call on analysis of microarray experiments , 2006, BMC Bioinformatics.

[23]  Kathleen F. Kerr,et al.  Evaluation of methods for oligonucleotide array data via quantitative real-time PCR , 2006, BMC Bioinformatics.

[24]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt): an expanding universe of protein information , 2005, Nucleic Acids Res..

[25]  Kiyoko F. Aoki-Kinoshita,et al.  From genomics to chemical genomics: new developments in KEGG , 2005, Nucleic Acids Res..

[26]  Carolina Cruz-Neira,et al.  Integration of metabolic networks and gene expression in virtual reality , 2005, Bioinform..

[27]  Atul J. Butte,et al.  Systematic survey reveals general applicability of "guilt-by-association" within gene coexpression networks , 2005, BMC Bioinformatics.

[28]  P. Zimmermann,et al.  Gene-expression analysis and network discovery using Genevestigator. , 2005, Trends in plant science.

[29]  Royston Goodacre,et al.  Identification of Novel Genes in Arabidopsis Involved in Secondary Cell Wall Formation Using Expression Profiling and Reverse Genetics , 2005, The Plant Cell Online.

[30]  Staffan Persson,et al.  Identification of genes required for cellulose synthesis by regression analysis of public microarray data sets. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Kiana Toufighi,et al.  The Botany Array Resource: E-northerns, Expression Angling, and Promoter Analyses , 2022 .

[32]  Thomas Girke,et al.  Genome Cluster Database. A Sequence Family Analysis Platform for Arabidopsis and Rice1 , 2005, Plant Physiology.

[33]  Stefan R. Henz,et al.  A gene expression map of Arabidopsis thaliana development , 2005, Nature Genetics.

[34]  R. Rodriguez,et al.  Balancing the generation and elimination of reactive oxygen species. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Björn Usadel,et al.  CSB.DB: a comprehensive systems-biology database , 2004, Bioinform..

[36]  David Botstein,et al.  GO: : TermFinder--open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes , 2004, Bioinform..

[37]  Rolf Apweiler,et al.  UniProt archive , 2004, Bioinform..

[38]  T. Girke,et al.  The Cell Wall Navigator Database. A Systems-Based Approach to Organism-Unrestricted Mining of Protein Families Involved in Cell Wall Metabolism1 , 2004, Plant Physiology.

[39]  J. Bailey-Serres,et al.  Plant responses to hypoxia--is survival a balancing act? , 2004, Trends in plant science.

[40]  P. Zimmermann,et al.  GENEVESTIGATOR. Arabidopsis Microarray Database and Analysis Toolbox1[w] , 2004, Plant Physiology.

[41]  Joachim Selbig,et al.  Hypothesis-driven approach to predict transcriptional units from gene expression data , 2004, Bioinform..

[42]  S Miyano,et al.  Open source clustering software. , 2004, Bioinformatics.

[43]  S. Rhee,et al.  Functional Annotation of the Arabidopsis Genome Using Controlled Vocabularies1 , 2004, Plant Physiology.

[44]  Søren Bak,et al.  Comparative Genomics of Rice and Arabidopsis. Analysis of 727 Cytochrome P450 Genes and Pseudogenes from a Monocot and a Dicot1[w] , 2004, Plant Physiology.

[45]  Gordon K Smyth,et al.  Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2004, Statistical applications in genetics and molecular biology.

[46]  Michael Gribskov,et al.  Systematic Trans-Genomic Comparison of Protein Kinases between Arabidopsis and Saccharomyces cerevisiae1 , 2003, Plant Physiology.

[47]  S. Rhee,et al.  AraCyc: A Biochemical Pathway Database for Arabidopsis1 , 2003, Plant Physiology.

[48]  T. Speed,et al.  Summaries of Affymetrix GeneChip probe level data. , 2003, Nucleic acids research.

[49]  W. Frommer,et al.  ARAMEMNON, a Novel Database for Arabidopsis Integral Membrane Proteins1 , 2003, Plant Physiology.

[50]  Wei-Min Liu,et al.  Analysis of high density expression microarrays with signed-rank call algorithms , 2002, Bioinform..

[51]  Jason E. Stewart,et al.  Minimum information about a microarray experiment (MIAME)—toward standards for microarray data , 2001, Nature Genetics.

[52]  Anupam Joshi,et al.  Low-complexity fuzzy relational clustering algorithms for Web mining , 2001, IEEE Trans. Fuzzy Syst..

[53]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[54]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[55]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[56]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[57]  C. Moorehead All rights reserved , 1997 .

[58]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[59]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[60]  David W. Scott The New S Language , 1990 .

[61]  Broome,et al.  Literature cited , 1924, A Guide to the Carnivores of Central America.

[62]  Gordon K. Smyth,et al.  limma: Linear Models for Microarray Data , 2005 .

[63]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[64]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .