DBD: a transcription factor prediction database

Regulation of gene expression influences almost all biological processes in an organism; sequence-specific DNA-binding transcription factors are critical to this control. For most genomes, the repertoire of transcription factors is only partially known. Hitherto transcription factor identification has been largely based on genome annotation pipelines that use pairwise sequence comparisons, which detect only those factors similar to known genes, or on functional classification schemes that amalgamate many types of proteins into the category of ‘transcription factor’. Using a novel transcription factor identification method, the DBD transcription factor database fills this void, providing genome-wide transcription factor predictions for organisms from across the tree of life. The prediction method behind DBD identifies sequence-specific DNA-binding transcription factors through homology using profile hidden Markov models (HMMs) of domains. Thus, it is limited to factors that are homologus to those HMMs. The collection of HMMs is taken from two existing databases (Pfam and SUPERFAMILY), and is limited to models that exclusively detect transcription factors that specifically recognize DNA sequences. It does not include basal transcription factors or chromatin-associated proteins, for instance. Based on comparison with experimentally verified annotation, the prediction procedure is between 95% and 99% accurate. Between one quarter and one-half of our genome-wide predicted transcription factors represent previously uncharacterized proteins. The DBD () consists of predicted transcription factor repertoires for 150 completely sequenced genomes, their domain assignments and the hand curated list of DNA-binding domain HMMs. Users can browse, search or download the predictions by genome, domain family or sequence identifier, view families of transcription factors based on domain architecture and receive predictions for a protein sequence.

[1]  Kenta Nakai,et al.  BTBS: database of transcriptional regulation in Bacillus subtilis and its contribution to comparative genomics , 2004, Nucleic Acids Res..

[2]  Eugene V Koonin,et al.  Extensive domain shuffling in transcription regulators of DNA viruses and implications for the origin of fungal APSES transcription factors , 2002, Genome Biology.

[3]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[4]  E. Wingender,et al.  MATCH: A tool for searching transcription factor binding sites in DNA sequences. , 2003, Nucleic acids research.

[5]  Madeline A. Crosby,et al.  FlyBase: genes and gene models , 2004, Nucleic Acids Res..

[6]  Daniel W. A. Buchan,et al.  Evolution of protein superfamilies and bacterial genome size. , 2004, Journal of molecular biology.

[7]  David N. Messina,et al.  An ORFeome-based analysis of human transcription factor genes and the construction of a microarray to interrogate their expression. , 2004, Genome research.

[8]  Wyeth W. Wasserman,et al.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles , 2004, Nucleic Acids Res..

[9]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[10]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[11]  Terry Gaasterland,et al.  Systematic characterization of the zinc-finger-containing proteins in the mouse transcriptome. , 2003, Genome research.

[12]  Nicola J. Rinaldi,et al.  Transcriptional regulatory code of a eukaryotic genome , 2004, Nature.

[13]  Nicola J. Rinaldi,et al.  Transcriptional Regulatory Networks in Saccharomyces cerevisiae , 2002, Science.

[14]  E. Nimwegen Scaling Laws in the Functional Content of Genomes , 2003, physics/0307001.

[15]  Judith A. Blake,et al.  MGD: the Mouse Genome Database , 2003, Nucleic Acids Res..

[16]  R. R. Samaha,et al.  Arabidopsis transcription factors: genome-wide comparative analysis among eukaryotes. , 2000, Science.

[17]  Julio Collado-Vides,et al.  RegulonDB (version 4.0): transcriptional regulation, operon organization and growth conditions in Escherichia coli K-12 , 2004, Nucleic Acids Res..

[18]  Xin Chen,et al.  TRANSFAC: an integrated system for gene expression regulation , 2000, Nucleic Acids Res..

[19]  Francesc X. Avilés,et al.  TrSDB: a proteome database of transcription factors , 2004, Nucleic Acids Res..

[20]  M. Gerstein,et al.  Genomic analysis of regulatory network dynamics reveals large topological changes , 2004, Nature.

[21]  M. Gerstein,et al.  Structure and evolution of transcriptional regulatory networks. , 2004, Current opinion in structural biology.

[22]  Cyrus Chothia,et al.  The SUPERFAMILY database in 2004: additions and improvements , 2004, Nucleic Acids Res..

[23]  S. Teichmann,et al.  Evolution of transcription factors and the gene regulatory network in Escherichia coli. , 2003, Nucleic acids research.

[24]  Alexander E. Kel,et al.  MATCHTM: a tool for searching transcription factor binding sites in DNA sequences , 2003, Nucleic Acids Res..

[25]  J. Kawai,et al.  A genome-wide and nonredundant mouse transcription factor database. , 2004, Biochemical and biophysical research communications.