Boosting with stumps for predicting transcription start sites

Promoter prediction is a difficult but important problem in gene finding, and it is critical for elucidating the regulation of gene expression. We introduce a new promoter prediction program, CoreBoost, which applies a boosting technique with stumps to select important small-scale as well as large-scale features. CoreBoost improves greatly on locating transcription start sites. We also demonstrate that by further utilizing some tissue-specific information, better accuracy can be achieved.

[1]  Sin Lam Tan,et al.  Mice and Men: Their Promoter Properties , 2006, PLoS genetics.

[2]  T. Hubbard,et al.  Computational detection and location of transcription start sites in mammalian genomic DNA. , 2002, Genome research.

[3]  S. Salzberg,et al.  The Transcriptional Landscape of the Mammalian Genome , 2005, Science.

[4]  Sumio Sugano,et al.  5′-end SAGE for the analysis of transcriptional start sites , 2004, Nature Biotechnology.

[5]  Heinrich Niemann,et al.  Joint modeling of DNA sequence and physical properties to improve eukaryotic promoter recognition , 2001, ISMB.

[6]  D. Brutlag,et al.  A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[7]  J. T. Kadonaga,et al.  *To whom correspondence should be addressed. E- , 2022 .

[8]  Peter Bühlmann,et al.  Boosting for Tumor Classification with Gene Expression Data , 2003, Bioinform..

[9]  J. T. Kadonaga,et al.  The RNA polymerase II core promoter. , 2003, Annual review of biochemistry.

[10]  Michael Q. Zhang,et al.  Genome-wide promoter extraction and analysis in human, mouse, and rat , 2005, Genome Biology.

[11]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[12]  M. Q. Zhang,et al.  Identification of human gene core promoters in silico. , 1998, Genome research.

[13]  Uwe Ohler,et al.  Performance assessment of promoter predictions on ENCODE regions in the EGASP experiment , 2006, Genome Biology.

[14]  Leah Barrera,et al.  A high-resolution map of active promoters in the human genome , 2005, Nature.

[15]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[16]  Marcel Dettling,et al.  BagBoosting for tumor classification with gene expression data , 2004, Bioinform..

[17]  Yoshiro Fukue,et al.  A highly distinctive mechanical property found in the majority of human promoters and its transcriptional relevance , 2005, Nucleic acids research.

[18]  J. Fickett,et al.  Eukaryotic promoter recognition. , 1997, Genome research.

[19]  P. Bucher Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. , 1990, Journal of molecular biology.

[20]  Alexander E. Kel,et al.  MATCHTM: a tool for searching transcription factor binding sites in DNA sequences , 2003, Nucleic Acids Res..

[21]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[22]  Sin Lam Tan,et al.  Promoter prediction analysis on the whole human genome , 2004, Nature Biotechnology.

[23]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[24]  Kenta Nakai,et al.  BTSS, DataBase of Transcriptional Start Sites: progress report 2004 , 2004, Nucleic Acids Res..

[25]  M. Bucan,et al.  Promoter features related to tissue specificity as measured by Shannon entropy , 2005, Genome Biology.

[26]  Uwe Ohler,et al.  The MTE, a new core promoter element for transcription by RNA polymerase II. , 2004, Genes & development.

[27]  Michael Ruogu Zhang,et al.  Computational identification of promoters and first exons in the human genome , 2002, Nature Genetics.

[28]  C. Hunter,et al.  Sequence-dependent DNA structure: tetranucleotide conformational maps. , 2000, Journal of molecular biology.

[29]  Martin S. Taylor,et al.  Genome-wide analysis of mammalian promoter architecture and evolution , 2006, Nature Genetics.

[30]  Naum I. Gershenzon,et al.  Synergy of human Pol II core promoter elements revealed by statistical sequence analysis , 2005, Bioinform..

[31]  Leslie G. Valiant,et al.  Cryptographic Limitations on Learning Boolean Formulae and Finite Automata , 1993, Machine Learning: From Theory to Applications.

[32]  E. Wingender,et al.  MATCH: A tool for searching transcription factor binding sites in DNA sequences. , 2003, Nucleic acids research.

[33]  M. Q. Zhang,et al.  A discrimination study of human core-promoters. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[34]  C. Chiang,et al.  The General Transcription Machinery and General Cofactors , 2006, Critical reviews in biochemistry and molecular biology.

[35]  Philipp Bucher,et al.  The Eukaryotic Promoter Database (EPD) , 2000, Nucleic Acids Res..

[36]  C Burks,et al.  The density of transcriptional elements in promoter and non-promoter sequences. , 1993, Human molecular genetics.

[37]  Thomas Werner,et al.  The State of the Art of Mammalian Promoter Recognition , 2003, Briefings Bioinform..

[38]  Xin Chen,et al.  TRANSFAC: an integrated system for gene expression regulation , 2000, Nucleic Acids Res..

[39]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[40]  Vladimir B. Bajic,et al.  Dragon Promoter Finder: recognition of vertebrate RNA polymerase II promoters , 2002, Bioinform..