A general integrative genomic feature transcription factor binding site prediction method applied to analysis of USF1 binding in cardiovascular disease

Transcription factors are key mediators of human complex disease processes. Identifying the target genes of transcription factors will increase our understanding of the biological network leading to disease risk. The prediction of transcription factor binding sites (TFBSs) is one method to identify these target genes; however, current prediction methods need improvement. We chose the transcription factor upstream stimulatory factor l (USF1) to evaluate the performance of our novel TFBS prediction method because of its known genetic association with coronary artery disease (CAD) and the recent availability of USF1 chromatin immunoprecipitation microarray (ChIP-chip) results. The specific goals of our study were to develop a novel and accurate genome-scale method for predicting USF1 binding sites and associated target genes to aid in the study of CAD. Previously published USF1 ChIP-chip data for 1 per cent of the genome were used to develop and evaluate several kernel logistic regression prediction models. A combination of genomic features (phylogenetic conservation, regulatory potential, presence of a CpG island and DNaseI hypersensitivity), as well as position weight matrix (PWM) scores, were used as variables for these models. Our most accurate predictor achieved an area under the receiver operator characteristic curve of 0.827 during cross-validation experiments, significantly outperforming standard PWM-based prediction methods. When applied to the whole human genome, we predicted 24,010 USF1 binding sites within 5 kilobases upstream of the transcription start site of 9,721 genes. These predictions included 16 of 20 genes with strong evidence of USF1 regulation. Finally, in the spirit of genomic convergence, we integrated independent experimental CAD data with these USF1 binding site prediction results to develop a prioritised set of candidate genes for future CAD studies. We have shown that our novel prediction method, which employs genomic features related to the presence of regulatory elements, enables more accurate and efficient prediction of USF1 binding sites. This method can be extended to other transcription factors identified in human disease studies to help further our understanding of the biology of complex disease.

[1]  M. Frommer,et al.  CpG islands in vertebrate genomes. , 1987, Journal of molecular biology.

[2]  P. Molloy,et al.  Base preferences for DNA binding by the bHLH-Zip protein USF: effects of MgCl2 on specificity and comparison with binding of Myc family members. , 1994, Nucleic acids research.

[3]  M. Q. Zhang,et al.  Identification of human gene core promoters in silico. , 1998, Genome research.

[4]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[5]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[6]  Michael Ruogu Zhang,et al.  Computational identification of promoters and first exons in the human genome , 2002, Nature Genetics.

[7]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[8]  M. Groudine,et al.  Controlling the double helix , 2003, Nature.

[9]  E. Topol,et al.  Mutation of MEF2A in an Inherited Disorder with Features of Coronary Artery Disease , 2003, Science.

[10]  M. Bulyk Computational prediction of transcription-factor binding site locations , 2003, Genome Biology.

[11]  Terrence S. Furey,et al.  The UCSC Genome Browser Database , 2003, Nucleic Acids Res..

[12]  Eden Martin,et al.  Genomic convergence: identifying candidate genes for Parkinson's disease by combining serial analysis of gene expression and genetic linkage. , 2003, Human molecular genetics.

[13]  S. Cawley,et al.  Unbiased Mapping of Transcription Factor Binding Sites along Human Chromosomes 21 and 22 Points to Widespread Regulation of Noncoding RNAs , 2004, Cell.

[14]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[15]  L. Peltonen,et al.  Familial combined hyperlipidemia is associated with upstream transcription factor 1 (USF1) , 2004, Nature Genetics.

[16]  M. West,et al.  Gene Expression Phenotypes of Atherosclerosis , 2004, Arteriosclerosis, thrombosis, and vascular biology.

[17]  T. Minka A comparison of numerical optimizers for logistic regression , 2004 .

[18]  Jan Komorowski,et al.  Binding sites for metabolic disease related transcription factors inferred at base pair resolution by chromatin immunoprecipitation and genomic microarrays. , 2005, Human molecular genetics.

[19]  S. Corre,et al.  Upstream stimulating factors: highly versatile stress-responsive transcription factors. , 2005, Pigment cell research.

[20]  Russell V. Lenth,et al.  SPSS 12.0 Statistical Procedures Companion , 2005 .

[21]  M. Boehnke,et al.  The role of HNF4A variants in the risk of type 2 diabetes , 2005, Current diabetes reports.

[22]  Francesca Chiaromonte,et al.  Evaluation of regulatory potential and conservation scores for detecting cis-regulatory modules in aligned mammalian genome sequences. , 2005, Genome research.

[23]  D. Haussler,et al.  Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. , 2005, Genome research.

[24]  L. Peltonen,et al.  USF1 and dyslipidemias: converging evidence for a functional intronic variant. , 2005, Human molecular genetics.

[25]  Mark Gerstein,et al.  Global changes in STAT target selection and transcription regulation upon interferon treatments. , 2005, Genes & development.

[26]  L. Peltonen,et al.  Risk Alleles of USF1 Gene Predict Cardiovascular Disease of Women in Two Prospective Studies , 2006, PLoS genetics.

[27]  Ernest Fraenkel,et al.  Practical Strategies for Discovering Regulatory DNA Sequence Motifs , 2006, PLoS Comput. Biol..

[28]  Elizabeth R Hauser,et al.  GATA2 Is Associated with Familial Early-Onset Coronary Artery Disease , 2006, PLoS genetics.

[29]  Michael Q. Zhang,et al.  TRED: a transcriptional regulatory element database, new entries and other development , 2007, Nucleic Acids Res..

[30]  Allen D. Delaney,et al.  Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing , 2007, Nature Methods.

[31]  Henriette O'Geen,et al.  Identification of Genes Directly Regulated by the Oncogene ZNF217 Using Chromatin Immunoprecipitation (ChIP)-Chip Assays* , 2007, Journal of Biological Chemistry.

[32]  David J. Arenillas,et al.  In Silico Detection of Sequence Variations Modifying Transcriptional Regulation , 2007, PLoS Comput. Biol..

[33]  Z. Weng,et al.  High-Resolution Mapping and Characterization of Open Chromatin across the Genome , 2008, Cell.