Ab initio identification of putative human transcription factor binding sites by comparative genomics

BackgroundUnderstanding transcriptional regulation of gene expression is one of the greatest challenges of modern molecular biology. A central role in this mechanism is played by transcription factors, which typically bind to specific, short DNA sequence motifs usually located in the upstream region of the regulated genes. We discuss here a simple and powerful approach for the ab initio identification of these cis-regulatory motifs. The method we present integrates several elements: human-mouse comparison, statistical analysis of genomic sequences and the concept of coregulation. We apply it to a complete scan of the human genome.ResultsBy using the catalogue of conserved upstream sequences collected in the CORG database we construct sets of genes sharing the same overrepresented motif (short DNA sequence) in their upstream regions both in human and in mouse. We perform this construction for all possible motifs from 5 to 8 nucleotides in length and then filter the resulting sets looking for two types of evidence of coregulation: first, we analyze the Gene Ontology annotation of the genes in the set, searching for statistically significant common annotations; second, we analyze the expression profiles of the genes in the set as measured by microarray experiments, searching for evidence of coexpression. The sets which pass one or both filters are conjectured to contain a significant fraction of coregulated genes, and the upstream motifs characterizing the sets are thus good candidates to be the binding sites of the TF's involved in such regulation.In this way we find various known motifs and also some new candidate binding sites.ConclusionWe have discussed a new integrated algorithm for the "ab initio" identification of transcription factor binding sites in the human genome. The method is based on three ingredients: comparative genomics, overrepresentation, different types of coregulation. The method is applied to a full-scan of the human genome, giving satisfactory results.

[1]  L. Duret,et al.  Strong conservation of non-coding sequences during vertebrates evolution: potential involvement in post-transcriptional regulation of gene expression. , 1993, Nucleic acids research.

[2]  Wyeth W. Wasserman,et al.  ConSite: web-based prediction of regulatory elements using cross-species comparison , 2004, Nucleic Acids Res..

[3]  Berthold Göttgens,et al.  Analysis of vertebrate SCL loci identifies conserved enhancers , 2000, Nature Biotechnology.

[4]  Alan M. Moses,et al.  Phylogenetically and spatially conserved word pairs associated with gene-expression changes in yeasts , 2003, RECOMB '03.

[5]  Lars Juhl Jensen,et al.  Automatic discovery of regulatory patterns in promoter regions based on whole cell expression data and functional annotation , 2000, Bioinform..

[6]  R. J. Cho,et al.  Candidate regulatory sequence elements for cell cycle-dependent transcription in Saccharomyces cerevisiae. , 1999, Genome research.

[7]  R. Kaufman,et al.  Activation of ATF6 and an ATF6 DNA binding site by the endoplasmic reticulum stress response. , 2000, The Journal of biological chemistry.

[8]  C. Ball,et al.  Identification of genes periodically expressed in the human cell cycle and their expression in tumors. , 2002, Molecular biology of the cell.

[9]  Michele Caselle,et al.  Computational identification of transcription factor binding sites by functional analysis of sets of genes sharing overrep-resented upstream motifs , 2003, BMC Bioinformatics.

[10]  W Miller,et al.  Comparative genome analysis delimits a chromosomal domain and identifies key regulatory elements in the alpha globin cluster. , 2001, Human molecular genetics.

[11]  J. Collado-Vides,et al.  Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. , 2000, Nucleic acids research.

[12]  G. Church,et al.  Genome-wide co-occurrence of promoter elements reveals a cis-regulatory cassette of rRNA transcription motifs in Saccharomyces cerevisiae. , 2002, Genome research.

[13]  Michele Caselle,et al.  Correlating overrepresented upstream motifs to gene expression: a computational approach to regulatory element discovery in eukaryotes , 2001, BMC Bioinformatics.

[14]  D. Shasha,et al.  cis element/transcription factor analysis (cis/TF): a method for discovering transcription factor/cis element relationships. , 2001, Genome research.

[15]  M. Tompa,et al.  Discovery of novel transcription factor binding sites by statistical overrepresentation. , 2002, Nucleic acids research.

[16]  J. Collado-Vides,et al.  Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. , 1998, Journal of molecular biology.

[17]  A. Sandelin,et al.  Applied bioinformatics for the identification of regulatory elements , 2004, Nature Reviews Genetics.

[18]  Martin Vingron,et al.  Annotating regulatory DNA based on man-mouse genomic comparison , 2002, ECCB.

[19]  W. Miller,et al.  Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons. , 2000, Science.

[20]  G. Church,et al.  Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. , 2000, Journal of molecular biology.

[21]  L. Pennacchio,et al.  Genomic strategies to identify mammalian regulatory sequences , 2001, Nature Reviews Genetics.

[22]  Martin Vingron,et al.  CORG: a database for COmparative Regulatory Genomics , 2003, Nucleic Acids Res..

[23]  M. Gerstein,et al.  Of mice and men: phylogenetic footprinting aids the discovery of regulatory elements , 2003, Journal of biology.

[24]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[25]  K. Mikoshiba,et al.  Molecular Properties of Zic Proteins as Transcriptional Regulators and Their Relationship to GLI Proteins* , 2001, The Journal of Biological Chemistry.

[26]  J. Nevins,et al.  Regulation of the cyclin E gene by transcription factor E2F1. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Eric C. Rouchka,et al.  Gibbs Recursive Sampler: finding transcription factor binding sites , 2003, Nucleic Acids Res..

[28]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[29]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[30]  Webb Miller,et al.  Comparative genome analysis delimits a chromosomal domain and identifies key regulatory elements in the α globin cluster , 2001 .

[31]  Kathleen Marchal,et al.  A Gibbs sampling method to detect over-represented motifs in the upstream regions of co-expressed genes , 2001, RECOMB.

[32]  A. Sandelin,et al.  Identification of conserved regulatory elements by comparative genome analysis , 2003, Journal of biology.

[33]  R. Hardison Conserved noncoding sequences are reliable guides to regulatory elements. , 2000, Trends in genetics : TIG.

[34]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[35]  Martin Vingron,et al.  Sequence Comparison Significance and Poisson Approximation , 1994 .

[36]  Mathieu Blanchette,et al.  Motif Discovery in Heterogeneous Sequence Data , 2003, Pacific Symposium on Biocomputing.

[37]  G. Stormo,et al.  Identifying protein-binding sites from unaligned DNA fragments. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[38]  Saurabh Sinha,et al.  YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation , 2003, Nucleic Acids Res..

[39]  N. Rosenthal,et al.  Paired MyoD-binding sites regulate myosin light chain gene expression. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[40]  M. Waterman,et al.  A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons. , 1987, Journal of molecular biology.

[41]  George M. Church,et al.  Genome-wide co-occurrence of promoter elements reveals a cis-regulatory cassette of rRNA transcription motifs in S . cerevisiae , 2002 .

[42]  R A Graves,et al.  Structure of a cluster of mouse histone genes. , 1983, Nucleic acids research.