Discovering Transcriptional Modules by Combined Analysis of Expression Profiles and Regulatory Sequences

A key goal of gene expression analysis is the characterization of transcription factors (TFs) and micro-RNAs (miRNAs) regulating specific transcriptional programs The most common approach to address this task is a two-step methodology: In the first step, a clustering procedure is executed to partition the genes into groups that are believed to be co-regulated, based on expression profile similarity In the second step, a motif discovery tool is applied to search for over-represented cis-regulatory motifs within each group In an effort to obtain better results by simultaneously utilizing all available information, several studies have suggested computational schemes for a single-step combined analysis of expression and sequence data Despite extensive research, reverse engineering complex regulatory networks from microarray measurements remains a difficult challenge with limited success, especially in metazoans. We present Allegro [1], a new method for de-novo discovery of TF and miRNA binding sites through joint analysis of genome-wide expression data and promoter or 3' UTR sequences In brief, Allegro enumerates a huge number of candidate motifs in a series of refinement phases to converge to high-scoring motifs For each candidate motif, it executes a cross-validation-like procedure to learn an expression model that describes the shared expression profile of the genes, whose cis-regulatory sequence contains the motif It then computes a p-value for the over-representation of the motif within the genes that best fit the expression profile The output of Allegro is a non-redundant list of top-scoring motifs and the expression patterns they induce. The expression model used by Allegro is a novel log likelihood-based, non-parametric model, analogous to the position weight matrix commonly used for representing TF binding sites Unlike most extant methods, our approach does not assume that the expression values follow a pre-defined type of distribution, and can capture transcriptional modules whose expression profiles differ from the rest of the genome across a small fraction of the conditions Furthermore, it successfully handles cases where the expression levels are correlated to the length and GC-content of the cis-regulatory sequences Such correlations are quite common in practice, and often bias existing techniques, leading to false predictions and low sensitivity. Allegro introduces several additional unique ideas and features, and is implemented in a graphical, user-friendly software tool We apply it on several large datasets (>100 conditions), in murine, fly and human, report on the transcriptional modules it uncovers, and show that it outperforms extant techniques Allegro is available at http://acgt.cs.tau.ac.il/allegro.