The influence of the background model on DNA motif prediction: An assessment for zinc finger transcription factor ZFX

Motif finding is a computationally expensive procedure subject to noise and false positives, but of major importance in understanding gene expression and cancer. Several authors argued in favor of using higher order background models to better discriminate motifs. This paper studies the effect of using Markov higher order models in three commonly used algorithms to identify the ZFX transcription factor's binding sites from a mouse embryonic stem cells dataset. We conclude that there are particular Markov orders that yield improved outcomes for each algorithm.

[1]  M. Tomishima,et al.  ZFX Controls the Self-Renewal of Human Embryonic Stem Cells , 2012, PloS one.

[2]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[3]  Wyeth W. Wasserman,et al.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles , 2004, Nucleic Acids Res..

[4]  Xuhua Xia,et al.  Position Weight Matrix, Gibbs Sampler, and the Associated Significance Tests in Motif Characterization and Prediction , 2012, Scientifica.

[5]  Timothy L. Bailey,et al.  Gene expression Advance Access publication May 4, 2011 DREME: motif discovery in transcription factor ChIP-seq data , 2011 .

[6]  Wilfred W. Li,et al.  MEME: discovering and analyzing DNA and protein sequence motifs , 2006, Nucleic Acids Res..

[7]  William Stafford Noble,et al.  Quantifying similarity between motifs , 2007, Genome Biology.

[8]  J. Collado-Vides,et al.  Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. , 1998, Journal of molecular biology.

[9]  P. D’haeseleer What are DNA sequence motifs? , 2006, Nature Biotechnology.

[10]  N. Wong,et al.  Overexpression of ZFX confers self‐renewal and chemoresistance properties in hepatocellular carcinoma , 2014, International journal of cancer.

[11]  John E. Reid,et al.  STEME: efficient EM to find motifs in large data sets , 2011, Nucleic acids research.

[12]  Mikael Bodén,et al.  MEME Suite: tools for motif discovery and searching , 2009, Nucleic Acids Res..

[13]  T. Hubbard,et al.  NestedMICA: sensitive inference of over-represented motifs in nucleic acid sequence , 2005, Nucleic acids research.

[14]  P. Canoll,et al.  Zfx facilitates tumorigenesis caused by activation of the Hedgehog pathway. , 2014, Cancer research.

[15]  Panayiotis V. Benos,et al.  STAMP: a web tool for exploring DNA-binding motif similarities , 2007, Nucleic Acids Res..

[16]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[17]  Amin Zia,et al.  Towards a theoretical understanding of false positives in DNA motif finding , 2010, BMC Bioinformatics.

[18]  Kathleen Marchal,et al.  A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling , 2001, Bioinform..

[19]  Nak-Kyeong Kim,et al.  Adding sequence context to a Markov background model improves the identification of regulatory elements , 2006, Bioinform..

[20]  N. D. Clarke,et al.  Integration of External Signaling Pathways with the Core Transcriptional Network in Embryonic Stem Cells , 2008, Cell.

[21]  William Stafford Noble,et al.  Motif-based analysis of large nucleotide data sets using MEME-ChIP , 2014, Nature Protocols.

[22]  J. Helden,et al.  A complete workflow for the analysis of full-size ChIP-seq (and similar) data sets using peak-motifs , 2012, Nature Protocols.

[23]  Philip Machanick,et al.  MEME-ChIP: motif analysis of large DNA datasets , 2011, Bioinform..

[24]  Dawn Field,et al.  Open software for biologists: from famine to feast , 2006, Nature Biotechnology.

[25]  C. Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Machine Learning.

[26]  L. Kedes,et al.  Nomenclature for incompletely specified bases in nucleic acid sequences. Recommendations 1984. Nomenclature Committee of the International Union of Biochemistry (NC-IUB). , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Portland Press Ltd Nomenclature Committee for the International Union of Biochemistry (NC-IUB). Nomenclature for incompletely specified bases in nucleic acid sequences. Recommendations 1984. , 1985, Molecular biology and evolution.