GPMiner: an integrated system for mining combinatorial cis-regulatory elements in mammalian gene group

BackgroundSequence features in promoter regions are involved in regulating gene transcription initiation. Although numerous computational methods have been developed for predicting transcriptional start sites (TSSs) or transcription factor (TF) binding sites (TFBSs), they lack annotations for do not consider some important regulatory features such as CpG islands, tandem repeats, the TATA box, CCAAT box, GC box, over-represented oligonucleotides, DNA stability, and GC content. Additionally, the combinatorial interaction of TFs regulates the gene group that is associated with same expression pattern. To investigate gene transcriptional regulation, an integrated system that annotates regulatory features in a promoter sequence and detects co-regulation of TFs in a group of genes is needed.ResultsThis work identifies TSSs and regulatory features in a promoter sequence, and recognizes co-occurrence of cis-regulatory elements in co-expressed genes using a novel system. Three well-known TSS prediction tools are incorporated with orthologous conserved features, such as CpG islands, nucleotide composition, over-represented hexamer nucleotides, and DNA stability, to construct the novel Gene Promoter Miner (GPMiner) using a support vector machine (SVM). According to five-fold cross-validation results, the predictive sensitivity and specificity are both roughly 80%. The proposed system allows users to input a group of gene names/symbols, enabling the co-occurrence of TFBSs to be determined. Additionally, an input sequence can also be analyzed for homogeneity of experimental mammalian promoter sequences, and conserved regulatory features between homologous promoters can be observed through cross-species analysis. After identifying promoter regions, regulatory features are visualized graphically to facilitate gene promoter observations.ConclusionsThe GPMiner, which has a user-friendly input/output interface, has numerous benefits in analyzing human and mouse promoters. The proposed system is freely available at http://GPMiner.mbc.nctu.edu.tw/.

[1]  T. Hubbard,et al.  Computational detection and location of transcription start sites in mammalian genomic DNA. , 2002, Genome research.

[2]  Mattias Höglund,et al.  Genome-wide transcription factor binding site/promoter databases for the analysis of gene sets and co-occurrence of transcription factor binding motifs , 2010, BMC Genomics.

[3]  Seng Hong Seah,et al.  Dragon gene start finder: an advanced system for finding approximate locations of the start of gene transcriptional units. , 2003, Genome research.

[4]  G. Rubin,et al.  Computational analysis of core promoters in the Drosophila genome , 2002, Genome Biology.

[5]  K. Gardner,et al.  Identification of new Rel/NFκB regulatory networks by focused genome location analysis , 2009, Cell cycle.

[6]  E. Wingender,et al.  MATCH: A tool for searching transcription factor binding sites in DNA sequences. , 2003, Nucleic acids research.

[7]  Wyeth W. Wasserman,et al.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles , 2004, Nucleic Acids Res..

[8]  L. Aravind,et al.  Comprehensive analysis of combinatorial regulation using the transcriptional regulatory network of yeast. , 2006, Journal of molecular biology.

[9]  Martin S. Taylor,et al.  Genome-wide analysis of mammalian promoter architecture and evolution , 2006, Nature Genetics.

[10]  B. De Moor,et al.  Toucan: deciphering the cis-regulatory logic of coregulated genes. , 2003, Nucleic acids research.

[11]  Hsien-Da Huang,et al.  Identifying transcriptional regulatory sites in the human genome using an integrated system. , 2004, Nucleic acids research.

[12]  Manju Bansal,et al.  A novel method for prokaryotic promoter prediction based on DNA stability , 2005, BMC Bioinformatics.

[13]  Victor V. Solovyev,et al.  PromH: promoters identification using orthologous genomic sequences , 2003, Nucleic Acids Res..

[14]  Manju Bansal,et al.  Structural properties of promoters: similarities and differences between prokaryotes and eukaryotes , 2005, Nucleic acids research.

[15]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[16]  H. Karas,et al.  TRANSFAC database as a bridge between sequence data libraries and biological function. , 1997, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[17]  Kenta Nakai,et al.  Modeling tissue-specific structural patterns in human and mouse promoters , 2009, Nucleic acids research.

[18]  T. Curran,et al.  Role of DNA 5-methylcytosine transferase in cell transformation by fos. , 1999, Science.

[19]  Tao Liu,et al.  CEAS: cis-regulatory element annotation system , 2009, Bioinform..

[20]  Kenta Nakai,et al.  DBTSS: DataBase of Human Transcription Start Sites, progress report 2006 , 2005, Nucleic Acids Res..

[21]  Zhiping Weng,et al.  PromoSer: a large-scale mammalian promoter and transcription start site identification service , 2003, Nucleic Acids Res..

[22]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[23]  Xin Chen,et al.  TRANSFAC: an integrated system for gene expression regulation , 2000, Nucleic Acids Res..

[24]  Michael Ruogu Zhang,et al.  Computational identification of promoters and first exons in the human genome , 2002, Nature Genetics.

[25]  Xueping Yu,et al.  Genome-wide prediction and characterization of interactions between transcription factors in Saccharomyces cerevisiae , 2006, Nucleic acids research.

[26]  Kengo Kinoshita,et al.  COXPRESdb: a database to compare gene coexpression in seven model animals , 2010, Nucleic Acids Res..

[27]  Dominique Mouchiroud,et al.  CpGProD: identifying CpG islands associated with transcription start sites in large genomic mammalian sequences , 2002, Bioinform..

[28]  U. Ohler,et al.  Promoter Prediction on a Genomic Scale – the Adh Experience , 2000 .

[29]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[30]  Damian Smedley,et al.  Ensembl 2005 , 2004, Nucleic Acids Res..

[31]  M. Batzer,et al.  Alu repeats and human genomic diversity , 2002, Nature Reviews Genetics.

[32]  Elmar Nöth,et al.  Interpolated markov chains for eukaryotic promoter recognition , 1999, Bioinform..

[33]  J. SantaLucia,et al.  A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[34]  R. Zhang,et al.  Improving promoter prediction for the NNPP 2 . 2 algorithm : a case study using Escherichia coli DNA sequences , 2004 .

[35]  S. Burden,et al.  Sequence analysis Improving promoter prediction for the NNPP 2 . 2 algorithm : a case study using Escherichia coli DNA sequences , 2005 .

[36]  P. Zimmermann,et al.  GENEVESTIGATOR. Arabidopsis Microarray Database and Analysis Toolbox1[w] , 2004, Plant Physiology.

[37]  Alexander E. Kel,et al.  MATCHTM: a tool for searching transcription factor binding sites in DNA sequences , 2003, Nucleic Acids Res..

[38]  Michael Q. Zhang,et al.  Identifying combinatorial regulation of transcription factors and binding motifs , 2004, Genome Biology.

[39]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[40]  Ramakrishnan Srikant,et al.  Mining generalized association rules , 1995, Future Gener. Comput. Syst..

[41]  Ankush Mittal,et al.  Computational modeling of oligonucleotide positional densities for human promoter prediction , 2005, Artif. Intell. Medicine.

[42]  Martin Tompa,et al.  Discovery of regulatory elements in vertebrates through comparative genomics , 2005, Nature Biotechnology.

[43]  Jun Song,et al.  CEAS: cis-regulatory element annotation system , 2006, Nucleic Acids Res..

[44]  Sin Lam Tan,et al.  Promoter prediction analysis on the whole human genome , 2004, Nature Biotechnology.

[45]  Tzong-Yi Lee,et al.  PlantPAN: Plant promoter analysis navigator, for identifying combinatorial cis-regulatory elements with distance constraint in plant gene groups , 2008, BMC Genomics.

[46]  Yoshihide Hayashizaki,et al.  [FANTOM-DB: database of functional annotation of RIKEN mouse cDNA clones]. , 2003, Seikagaku. The Journal of Japanese Biochemical Society.

[47]  Michael Q. Zhang,et al.  Genome-wide promoter extraction and analysis in human, mouse, and rat , 2005, Genome Biology.

[48]  K. Chou,et al.  Recent progress in protein subcellular location prediction. , 2007, Analytical biochemistry.

[49]  C. Vinson,et al.  Clustering of DNA sequences in human promoters. , 2004, Genome research.

[50]  Martin G. Reese,et al.  Application of a Time-delay Neural Network to Promoter Annotation in the Drosophila Melanogaster Genome , 2001, Comput. Chem..

[51]  Jef D. Boeke,et al.  Transcriptional disruption by the L1 retrotransposon and implications for mammalian transcriptomes , 2004, Nature.

[52]  Kengo Kinoshita,et al.  COXPRESdb: a database of coexpressed gene networks in mammals , 2007, Nucleic Acids Res..

[53]  E. Birney,et al.  The Ensembl core software libraries. , 2004, Genome research.