Predicting the in vivo signature of human gene regulatory sequence

MOTIVATION In the living cell nucleus, genomic DNA is packaged into chromatin. DNA sequences that regulate transcription and other chromosomal processes are associated with local disruptions, or 'openings', in chromatin structure caused by the cooperative action of regulatory proteins. Such perturbations are extremely specific for cis-regulatory elements and occur over short stretches of DNA (typically approximately 250 bp). They can be detected experimentally as DNaseI hypersensitive sites (HSs) in vivo, though the process is extremely laborious and costly. The ability to discriminate DNaseI HSs computationally would have a major impact on the annotation and utilization of the human genome. RESULTS We found that a supervised pattern recognition algorithm, trained using a set of 280 DNaseI HS and 737 non-HS control sequences from erythroid cells, was capable of de novo prediction of HSs across the human genome with surprisingly high accuracy determined by prospective in vivo validation. Systematic application of this computational approach will greatly facilitate the discovery and analysis of functional non-coding elements in the human and other complex genomes. AVAILABILITY Supplementary data is available at noble.gs.washington.edu/proj/hs

[1]  G. Felsenfeld,et al.  Chromatin Unfolds , 1996, Cell.

[2]  J. Stamatoyannopoulos,et al.  High-throughput localization of functional elements by quantitative chromatin profiling , 2004, Nature Methods.

[3]  M. Tompa,et al.  Discovery of novel transcription factor binding sites by statistical overrepresentation. , 2002, Nucleic acids research.

[4]  Xiangdong Fang,et al.  Locus control regions. , 2002, Blood.

[5]  M. Reitman,et al.  An enhancer/locus control region is not sufficient to open chromatin , 1993, Molecular and cellular biology.

[6]  A. West,et al.  Insulators and boundaries: versatile regulatory elements in the eukaryotic genome. , 2001, Science.

[7]  K. Heller,et al.  Sequence information for the splicing of human pre-mRNA identified by support vector machine classification. , 2003, Genome research.

[8]  Bernhard Schölkopf,et al.  Support Vector Machine Applications in Computational Biology , 2004 .

[9]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[10]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[11]  G. Stamatoyannopoulos,et al.  Quantification of DNaseI-sensitivity by real-time PCR: quantitative analysis of DNaseI-hypersensitivity of the mouse beta-globin LCR. , 2001, Journal of molecular biology.

[12]  Mathieu Blanchette,et al.  Motif Discovery in Heterogeneous Sequence Data , 2003, Pacific Symposium on Biocomputing.

[13]  J. Stamatoyannopoulos,et al.  Discovery of functional noncoding elements by digital analysis of chromatin structure. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[14]  J. Stamatoyannopoulos,et al.  Genome-wide identification of DNaseI hypersensitive sites using active chromatin sequence libraries. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[16]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[17]  Gunnar Rätsch,et al.  Engineering Support Vector Machine Kerneis That Recognize Translation Initialion Sites , 2000, German Conference on Bioinformatics.

[18]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[19]  F. Grosveld,et al.  Detailed analysis of the site 3 region of the human beta‐globin dominant control region. , 1990, The EMBO journal.

[20]  G. Rubin,et al.  Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[21]  James V. Candy,et al.  Adaptive and Learning Systems for Signal Processing, Communications, and Control , 2006 .

[22]  William Stafford Noble,et al.  Support vector machine classification on the web , 2004, Bioinform..

[23]  J. Stamatoyannopoulos,et al.  NF‐E2 and GATA binding motifs are required for the formation of DNase I hypersensitive site 4 of the human beta‐globin locus control region. , 1995, The EMBO journal.

[24]  Adrian Bird,et al.  Alternative chromatin structure at CpG islands , 1990, Cell.

[25]  A. Nienhuis,et al.  Mechanism of DNase I hypersensitive site formation within the human globin locus control region. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[26]  S Rozen,et al.  Primer3 on the WWW for general users and for biologist programmers. , 2000, Methods in molecular biology.

[27]  Carl Wu The 5′ ends of Drosophila heat shock genes in chromatin are hypersensitive to DNase I , 1980, Nature.

[28]  D. S. Gross,et al.  Nuclease hypersensitive sites in chromatin. , 1988, Annual review of biochemistry.

[29]  Positron Lifetime Spectra Support Vector Machine in Classification of , 2004 .