GeneXPress : A Visualization and Statistical Analysis Tool for Gene Expression and Sequence Data

Many algorithms have been developed for analyzing gene expression and sequence data. However, to extract biological understanding, scientists often have to perform further time consuming post-processing on the output of these algorithms. In this paper, we present GeneXPress, a tool designed to facilitate the assginment of biological meaning to gene expression patterns by automating this post processing stage. Within a few simple steps that take at most several minutes, a user of GeneXPress can: identify the biological processes represented by each cluster; identify the DNA binding sites that are unique to the genes in each cluster; and examine multiple visualizations of the expression and sequence data. GeneXPress thus allows the researcher to quickly identify potentially new biological discoveries. GeneXPress is available for download at http://GeneXPress.stanford.edu. Contact: E-mail : eran@cs.stanford.edu MOTIVATION The availability of complete genomic sequences and genome wide measurements of gene expression provide us with the means to understand cellular processes and their regulation on a genome wide scale. Indeed, much recent work has been devoted to analysis of these data for this purpose. The most common method for analyzing gene expression data is clustering (e.g., (1)), that groups together genes with similar expression profiles. Genes that are similarly expressed often participate in the same cellular processes, so clustering suggests functional relationships between the clustered genes. Similarly, we expect co-clustered genes to be co-regulated by the same cis-regulatory mechanism, which can be revealed by searching for commonly occurring motifs in the promoter regions of the genes in a cluster (e.g., (5)). The outputs of clustering and motif finding algorithms provide the basis for understanding the biological story underlying the data. However, to extract concrete biological understanding , further time-consuming post-analysis of such outputs is required. It is not rare for the analysis and post-analysis stages of a gene expression experiment to take several months of intensive manual work. This post-analysis usually focuses on relating gene expression patterns with other form of biological knowledge. During the analysis there is a need to answer questions such as: what biological processes are represented by each cluster; what cis-regulatory motifs are shared by genes within a cluster; how significant these associations are; and more. This type of analysis requires a multitude of scripts, visualizations, comparisons to multiple biological databases, and more. Currently, this work is duplicated many times, both within and between labs. In this paper, we present GeneXPress, a generalpurpose visualization and analysis tool that is designed to support extensive post-analysis of gene expression experiments. GeneXPress contains a suite of tools to automatically answer questions such as the ones we described above, through visual and statistical analysis of the outputs of clustering and motif finding algorithms. GeneXPress has several different visualizations that allow both global and detailed views of expression profiles, promoter regions, and motifs. Through statistical analysis of the clusters relative to databases of gene annotations (e.g., GO — http://geneontology.org), GeneXPress can associate each cluster with one or more biological processes. Through similar analysis for motifs, GeneXPress can identify the motifs that are present in the promoter regions of the genes in each cluster. The discovered associations are statistically benchmarked by p-values that are automatically computed for each association. GeneXPress uses simple and extensible XML-based file formats. It is easy to convert the output of clustering and motif finding algorithms to such format, and use them within GeneXPress. In addition, GeneXPress supports files generated for viewing with TreeView (http://rana.lbl.gov), the most commonly used software for visualizing expression data. GeneXPress implements all the views provided by TreeView, but enhances them to include additional convenient features. GeneXPress is freely available at http://GeneXPress.stanford.edu. The web site also provides sample files, detailed tutorials, and gene annotation and sequence motif files from existing databases that can be loaded to GeneXPress and used for analyzing