Theme discovery from gene lists for identification and viewing of multiple functional groups

BackgroundHigh throughput methods of the genome era produce vast amounts of data in the form of gene lists. These lists are large and difficult to interpret without advanced computational or bioinformatic tools. Most existing methods analyse a gene list as a single entity although it is comprised of multiple gene groups associated with separate biological functions. Therefore it is imperative to define and visualize gene groups with unique functionality within gene lists.ResultsIn order to analyse the functional heterogeneity within a gene list, we have developed a method that clusters genes to groups with homogenous functionalities. The method uses Non-negative Matrix Factorization (NMF) to create several clustering results with varying numbers of clusters. The obtained clustering results are combined into a simple graphical presentation showing the functional groups over-represented in the analyzed gene list. We demonstrate its performance on two data sets and show results that improve upon existing methods. The comparison also shows that our method creates a more simplified view that aids in discovery of biological themes within the list and discards less informative classes from the results.ConclusionThe presented method and associated software are useful for the identification and interpretation of biological functions associated with gene lists and are especially useful for the analysis of large lists.

[1]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[2]  Pablo Tamayo,et al.  Metagenes and molecular pattern discovery using matrix factorization , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Dmitrij Frishman,et al.  MIPS: a database for genomes and protein sequences , 2000, Nucleic Acids Res..

[4]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[5]  David Botstein,et al.  GO: : TermFinder--open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes , 2004, Bioinform..

[6]  Liisa Holm,et al.  Sensitive pattern discovery with 'fuzzy' alignments of distantly related proteins , 2003, ISMB.

[7]  David Martin,et al.  GOToolBox: functional analysis of gene datasets based on Gene Ontology , 2004, Genome Biology.

[8]  A. Agresti [A Survey of Exact Inference for Contingency Tables]: Rejoinder , 1992 .

[9]  Jun S. Liu,et al.  Clustering analysis of SAGE data using a Poisson approach , 2004, Genome Biology.

[10]  Kara Dolinski,et al.  Saccharomyces Genome Database (SGD) provides biochemical and structural information for budding yeast proteins , 2003, Nucleic Acids Res..

[11]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[12]  Yudong D. He,et al.  Functional Discovery via a Compendium of Expression Profiles , 2000, Cell.

[13]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[14]  Daphne Koller,et al.  Genome-wide discovery of transcriptional modules from DNA sequence and gene expression , 2003, ISMB.

[15]  Lani F. Wu,et al.  Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters , 2002, Nature Genetics.

[16]  Douglas A. Hosack,et al.  Identifying biological themes within lists of genes with EASE , 2003, Genome Biology.

[17]  Pasi Fränti,et al.  Classification of binary vectors by using SC distance to minimize stochastic complexity , 2003, Pattern Recognit. Lett..

[18]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[19]  Asla Pitkänen,et al.  Brain‐derived neurotrophic factor signaling modifies hippocampal gene expression during epileptogenesis in transgenic mice , 2004, The European journal of neuroscience.

[20]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[21]  Heikki Mannila,et al.  A Simple Algorithm for Topic Identification in 0-1 Data , 2003, PKDD.

[22]  Petri Törönen,et al.  Effects of Antidepressant Drug Imipramine on Gene Expression in Rat Prefrontal Cortex , 2004, Neurochemical Research.

[23]  Petri Törönen,et al.  Selection of informative clusters from hierarchical cluster tree with gene classes , 2004, BMC Bioinformatics.

[24]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[25]  Nazif Alic,et al.  Cells have distinct mechanisms to maintain protection against different reactive oxygen species: oxidative-stress-response genes. , 2004, Proceedings of the National Academy of Sciences of the United States of America.