Computational approaches to identifying transcription factor binding sites in yeast genome.

Publisher Summary This chapter describes the basic ideas behind some of the algorithms. In the yeast genome, there are an estimated 500-600 transcription factors. The targets of these factors can range from a few to a few hundred genes. Typically the upstream region of a gene includes the binding sites of different transcription factors that control the expression of the gene under different conditions. Thus, on a genomic scale, transcriptional regulation has a huge combinatorial complexity and the identification of regulatory sites is a very challenging problem. Several approaches have been developed to tackle this problem. One commonly used approach has been to delineate as sharply as possible a group of coregulated genes and search for common sequence patterns in their upstream regulatory regions. The computational algorithms range from finding overrepresented substrings or regular expression patterns to multiple local sequence alignment. The prerequisite for this class of algorithms is a cleanly defined subset of genes that may share a few common motifs. The chapter describes a few representatives in this category. An alternative approach is to delineate combinatorial motifs from a large collection of regulatory sequences without the need for defining coregulated groups. The chapter also describes two algorithms in this category that have been developed. One algorithm is based on a mathematical model of probabilistic segmentation, a generalization of segmentation models used in statistical language processing. Another algorithm identifies sequence patterns in the promoter regions of genes that strongly correlate with the genome-wide gene expression data.