Bioinformatics of eukaryotic gene regulation

Understanding the mechanisms which control gene expression is one of the fundamental problems of molecular biology. Detailed experimental studies of regulation are laborious due to the complex and combinatorial nature of interactions among involved molecules. Therefore, computational techniques are used to suggest candidate mechanisms for further investigation. This thesis presents three methods improving the predictions of regulation of gene transcription. The first approach finds binding sites recognized by a transcription factor based on statistical over-representation of short motifs in a set of promoter sequences. A succesful application of this method to several gene families of yeast Saccharomyces cerevisiae is shown. More advanced techniques are needed for the analysis of gene regulation in higher eukaryotes. Hundreds of profiles recognized by transcription factors are provided by libraries. Dependencies between them result in multiple predictions of the same binding sites which need later to be filtered out. Therefore, the second method presented here offers a way to reduce the number of profiles by identifying similarities between them. Still, the complex nature of interaction between transcription factors makes reliable predictions of binding sites difficult. Exploiting independent sources of information reduces the false predictions rate. The third method described here proposes a novel approach associating gene annotations with regulation of multiple transcription factors and binding sites recognized by them. The utility of the method is demonstrated on several well-known sets of transcription factors. Although the regulation of transcription is the major cellular mechanism of controlling gene expression, RNA interference provides a way of efficient down-regulation of specific genes in experiments. Difficulties in predicting efficient siRNA sequences motivated the development of a library containing siRNA sequences and related experimental details described in the literature. This library, presented in details in the last chapter, is publicly available at http://www.human-sirna-database.net.

[1]  E. Wingender,et al.  Recognition of NFATp/AP-1 composite elements within genes induced upon the activation of immune cells. , 1999, Journal of molecular biology.

[2]  Toshio Kojima,et al.  Assessment of clusters of transcription factor binding sites in relationship to human promoter, CpG islands and gene expression , 2004, BMC Genomics.

[3]  J. Collado-Vides,et al.  Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. , 1998, Journal of molecular biology.

[4]  Thomas Werner,et al.  Molecular and in silico characterization of a promoter module and C/EBP element that mediate LPS‐induced RANTES/CCL5 expression in monocytic cells , 2001, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[5]  Xin Chen,et al.  TRANSFAC: an integrated system for gene expression regulation , 2000, Nucleic Acids Res..

[6]  E. Davidson,et al.  Genomic cis-regulatory logic: experimental and computational analysis of a sea urchin gene. , 1998, Science.

[7]  M. Amarzguioui,et al.  An algorithm for selection of functional siRNA sequences. , 2004, Biochemical and biophysical research communications.

[8]  S. Cawley,et al.  Unbiased Mapping of Transcription Factor Binding Sites along Human Chromosomes 21 and 22 Points to Widespread Regulation of Noncoding RNAs , 2004, Cell.

[9]  D. Thomas,et al.  A heteromeric complex containing the centromere binding factor 1 and two basic leucine zipper factors, Met4 and Met28, mediates the transcription activation of yeast sulfur metabolism. , 1996, The EMBO journal.

[10]  Berthold Göttgens,et al.  Analysis of vertebrate SCL loci identifies conserved enhancers , 2000, Nature Biotechnology.

[11]  Z. Weng,et al.  Statistical significance of clusters of motifs represented by position specific scoring matrices in nucleotide sequences. , 2002, Nucleic acids research.

[12]  S. Pietrokovski Searching databases of conserved sequence regions by aligning protein multiple-alignments. , 1996, Nucleic acids research.

[13]  A. Sandelin,et al.  Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics. , 2004, Journal of molecular biology.

[14]  Martin C. Frith,et al.  Cluster-Buster: finding dense clusters of motifs in DNA sequences , 2003, Nucleic Acids Res..

[15]  P. Pevzner,et al.  Linguistics of nucleotide sequences. I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words. , 1989, Journal of biomolecular structure & dynamics.

[16]  Mark Gerstein,et al.  CREB Binds to Multiple Loci on Human Chromosome 22 , 2004, Molecular and Cellular Biology.

[17]  T. Heinemeyer,et al.  Databases on transcriptional regulation : TRANSFAC , TRRD and COMPEL , 1997 .

[18]  D. P. King,et al.  Role of the CLOCK protein in the mammalian circadian mechanism. , 1998, Science.

[19]  A. Sandelin,et al.  Applied bioinformatics for the identification of regulatory elements , 2004, Nature Reviews Genetics.

[20]  Martin Vingron,et al.  Annotating regulatory DNA based on man-mouse genomic comparison , 2002, ECCB.

[21]  Michael Q. Zhang,et al.  Computational identification of promoters and first exons in the human genome , 2001, Nature Genetics.

[22]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology , 2003, Nucleic Acids Res..

[23]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[24]  E. Lander,et al.  Expression analysis with oligonucleotide microarrays reveals that MYC regulates genes involved in growth, cell cycle, signaling, and adhesion. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Francis Lévi,et al.  Host circadian clock as a control point in tumor progression. , 2002, Journal of the National Cancer Institute.

[26]  Rongxiang Liu,et al.  Consensus promoter identification in the human genome utilizing expressed gene markers and gene modeling. , 2002, Genome research.

[27]  J. van Helden,et al.  Statistical analysis of yeast genomic downstream sequences reveals putative polyadenylation signals. , 2000, Nucleic acids research.

[28]  David Whitmore,et al.  E-box function in a period gene repressed by light , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Szymon M. Kielbasa,et al.  HuSiDa—the human siRNA database: an open-access database for published functional siRNA sequences and technical details of efficient transfer into recipient cells , 2004, Nucleic Acids Res..

[30]  J. Fickett,et al.  Identification of regulatory regions which confer muscle-specific gene expression. , 1998, Journal of molecular biology.

[31]  Wyeth W. Wasserman,et al.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles , 2004, Nucleic Acids Res..

[32]  R. Baler,et al.  Circadian Transcription , 2002, The Journal of Biological Chemistry.

[33]  A. Reynolds,et al.  Rational siRNA design for RNA interference , 2004, Nature Biotechnology.

[34]  Szymon M. Kielbasa,et al.  Measuring similarities between transcription factor binding sites , 2005, BMC Bioinformatics.

[35]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[36]  H. Herzel,et al.  Inferring combinatorial regulation of transcription in silico , 2005, Nucleic acids research.

[37]  R. Baler,et al.  The rat arylalkylamine N-acetyltransferase E-box: differential use in a master vs. a slave oscillator. , 2000, Brain research. Molecular brain research.

[38]  T. Tuschl,et al.  Analysis of gene function in somatic mammalian cells using small interfering RNAs. , 2002, Methods.

[39]  Hans-Werner Mewes,et al.  MIPS: a database for protein sequences, homology data and yeast genome information , 1997, Nucleic Acids Res..

[40]  A. Fire,et al.  Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans , 1998, Nature.

[41]  Wyeth W. Wasserman,et al.  ConSite: web-based prediction of regulatory elements using cross-species comparison , 2004, Nucleic Acids Res..

[42]  T. Werner,et al.  MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. , 1995, Nucleic acids research.

[43]  Heinrich Niemann,et al.  Joint modeling of DNA sequence and physical properties to improve eukaryotic promoter recognition , 2001, ISMB.

[44]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[45]  J. Hogenesch,et al.  The basic-helix-loop-helix-PAS orphan MOP3 forms transcriptionally active complexes with circadian and hypoxia factors. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[46]  M. Strauss,et al.  The retinoblastoma protein: a master regulator of cell cycle, differentiation and apoptosis. , 1997, European journal of biochemistry.

[47]  T. Werner,et al.  Computer modeling of promoter organization as a tool to study transcriptional coregulation , 2003, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[48]  T. D. Schneider,et al.  Information content of binding sites on nucleotide sequences. , 1986, Journal of molecular biology.

[49]  H. Herzel,et al.  Prediction of cis-regulatory elements of coregulated genes. , 2004, Genome informatics. International Conference on Genome Informatics.

[50]  G. Church,et al.  Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[51]  M. Q. Zhang Large-scale gene expression data analysis: a new challenge to computational biologists. , 1999, Genome research.

[52]  H. Bussemaker,et al.  Regulatory element detection using correlation with expression , 2001, Nature Genetics.

[53]  Hanspeter Herzel,et al.  Extracting information from cDNA arrays. , 2001, Chaos.

[54]  H. Quastler Information theory in psychology , 1955 .

[55]  Michele Caselle,et al.  Correlating overrepresented upstream motifs to gene expression: a computational approach to regulatory element discovery in eukaryotes , 2001, BMC Bioinformatics.

[56]  Alexander E. Kel,et al.  MATCHTM: a tool for searching transcription factor binding sites in DNA sequences , 2003, Nucleic Acids Res..

[57]  K. Kinzler,et al.  Serial Analysis of Gene Expression , 1995, Science.

[58]  Nils Blüthgen,et al.  Genome-wide Analysis of Functions Regulated by Sets of Transcription Factors , 2004, German Conference on Bioinformatics.

[59]  C. Lawrence,et al.  Human-mouse genome comparisons to locate regulatory sites , 2000, Nature Genetics.

[60]  P. Bucher,et al.  High-throughput SELEX–SAGE method for quantitative modeling of transcription-factor binding sites , 2002, Nature Biotechnology.

[61]  Pål Sætrom,et al.  Predicting the efficacy of short oligonucleotides in antisense and RNAi experiments with boosted genetic programming , 2004, Bioinform..

[62]  Ting Wang,et al.  Combining phylogenetic data with co-regulated genes to identify regulatory motifs , 2003, Bioinform..

[63]  M. Amarzguioui,et al.  Positional effects of short interfering RNAs targeting the human coagulation trigger Tissue Factor. , 2002, Nucleic acids research.

[64]  Kenta Nakai,et al.  DBTSS: DataBase of human Transcriptional Start Sites and full-length cDNAs , 2002, Nucleic Acids Res..

[65]  Z. Weng,et al.  Finding functional sequence elements by multiple local alignment. , 2004, Nucleic acids research.

[66]  Michael Q. Zhang,et al.  SCPD: a promoter database of the yeast Saccharomyces cerevisiae , 1999, Bioinform..

[67]  Zhiping Weng,et al.  SeqVISTA: a new module of integrated computational tools for studying transcriptional regulation , 2004, Nucleic Acids Res..

[68]  G. Lyons,et al.  Characterization of myocyte enhancer factor 2 (MEF2) expression in B and T cells: MEF2C is a B cell-restricted transcription factor in lymphocytes. , 1998, Molecular immunology.

[69]  P. Zamore,et al.  ATP Requirements and Small Interfering RNA Structure in the RNA Interference Pathway , 2001, Cell.

[70]  A. Wagner,et al.  A computational genomics approach to the identification of gene networks. , 1997, Nucleic acids research.

[71]  Saurabh Sinha,et al.  A Statistical Method for Finding Transcription Factor Binding Sites , 2000, ISMB.

[72]  M. Amarzguioui,et al.  Tolerance for mutations and chemical modifications in a siRNA. , 2003, Nucleic acids research.

[73]  Z. Weng,et al.  Detection of functional DNA motifs via statistical over-representation. , 2004, Nucleic acids research.

[74]  Rolf Backofen,et al.  Feature Based Representation and Detection of Transcription Factor Binding Sites , 2004, German Conference on Bioinformatics.

[75]  G. Church,et al.  Identifying regulatory networks by combinatorial analysis of promoter elements , 2001, Nature Genetics.

[76]  Aaron P. Campbell,et al.  Suppression subtractive hybridization: a method for generating differentially regulated or tissue-specific cDNA probes and libraries. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[77]  Nils Blüthgen,et al.  HOMGL - comparing genelists across species and with different accession numbers , 2004, Bioinform..

[78]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[79]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[80]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[81]  Peter M. Haverty,et al.  CARRIE web service: automated transcriptional regulatory network inference and interactive analysis , 2004, Nucleic Acids Res..

[82]  Jan O. Korbel,et al.  Combining frequency and positional information to predict transcription factor binding sites , 2001, Bioinform..

[83]  Paolo Sassone-Corsi,et al.  Timing the cell cycle , 2003, Nature Cell Biology.

[84]  Thomas E. Royce,et al.  Distribution of NF-κB-binding sites across human chromosome 22 , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[85]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[86]  K. Ui-Tei,et al.  Guidelines for the selection of highly effective siRNA sequences for mammalian and chick RNA interference. , 2004, Nucleic acids research.

[87]  Mark Borodovsky,et al.  First and second moment of counts of words in random texts generated by Markov chains , 1992, Comput. Appl. Biosci..

[88]  R. Gibbs,et al.  PipMaker--a web server for aligning two genomic DNA sequences. , 2000, Genome research.

[89]  Holger Karas,et al.  TRANSFAC: a database on transcription factors and their DNA binding sites , 1996, Nucleic Acids Res..

[90]  G. Church,et al.  Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. , 2000, Journal of molecular biology.

[91]  Martin Vingron,et al.  CORG: a database for COmparative Regulatory Genomics , 2003, Nucleic Acids Res..

[92]  Ola Snøve,et al.  A comparison of siRNA efficacy predictors. , 2004, Biochemical and biophysical research communications.

[93]  T. Werner,et al.  Regulatory context is a crucial part of gene function. , 2002, Trends in genetics : TIG.

[94]  S. Yamaguchi,et al.  Control Mechanism of the Circadian Clock for Timing of Cell Division in Vivo , 2003, Science.

[95]  P. V. von Hippel,et al.  Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. , 1987, Journal of molecular biology.

[96]  J. Fickett Copyright � 1996, American Society for Microbiology Quantitative Discrimination of MEF2 Sites , 1995 .

[97]  Alexander E. Kel,et al.  COMPEL: a database on composite regulatory elements providing combinatorial transcriptional regulation , 2000, Nucleic Acids Res..

[98]  T. Werner,et al.  Highly specific localization of promoter regions in large genomic sequences by PromoterInspector: a novel context analysis approach. , 2000, Journal of molecular biology.

[99]  P. Sharp,et al.  RNAi Double-Stranded RNA Directs the ATP-Dependent Cleavage of mRNA at 21 to 23 Nucleotide Intervals , 2000, Cell.

[100]  A. Sandelin,et al.  Identification of conserved regulatory elements by comparative genome analysis , 2003, Journal of biology.

[101]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[102]  I. Jonassen,et al.  Predicting gene regulatory elements in silico on a genomic scale. , 1998, Genome research.

[103]  Philipp Bucher,et al.  The Eukaryotic Promoter Database EPD , 1998, Nucleic Acids Res..

[104]  Michael T. McManus,et al.  Gene silencing in mammals by small interfering RNAs , 2002, Nature Reviews Genetics.

[105]  S. Levy,et al.  Predicting transcription factor synergism. , 2002, Nucleic acids research.

[106]  S. Jayasena,et al.  Functional siRNAs and miRNAs Exhibit Strand Bias , 2003, Cell.

[107]  Johannes Zuber,et al.  A genome-wide survey of RAS transformation targets , 2000, Nature Genetics.

[108]  J. Collado-Vides,et al.  Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. , 2000, Nucleic acids research.