Simultaneously Learning DNA Motif along with Its Position and Sequence Rank Preferences through EM Algorithm

Although de novo motifs can be discovered through mining over-represented sequence patterns, this approach misses some real motifs and generates many false positives. To improve accuracy, one solution is to consider some additional binding features (i.e. position preference and sequence rank preference). This information is usually required from the user. This paper presents a de novo motif discovery algorithm called SEME which uses pure probabilistic mixture model to model the motif's binding features and uses expectation maximization (EM) algorithms to simultaneously learn the sequence motif, position and sequence rank preferences without asking for any prior knowledge from the user. SEME is both efficient and accurate thanks to two important techniques: the variable motif length extension and importance sampling. Using 75 large scale synthetic datasets, 32 metazoan compendium benchmark datasets and 164 ChIP-Seq libraries, we demonstrated the superior performance of SEME over existing programs in finding transcription factor (TF) binding sites. SEME is further applied to a more difficult problem of finding the co-regulated TF (co-TF) motifs in 15 ChIP-Seq libraries. It identified significantly more correct co-TF motifs and, at the same time, predicted co-TF motifs with better matching to the known motifs. Finally, we show that the learned position and sequence rank preferences of each co-TF reveals potential interaction mechanisms between the primary TF and the co-TF within these sites. Some of these findings were further validated by the ChIP-Seq experiments of the co-TFs.

[1]  Timothy L. Bailey,et al.  Gene expression Advance Access publication May 4, 2011 DREME: motif discovery in transcription factor ChIP-seq data , 2011 .

[2]  E. Cheung,et al.  Genomic analyses of hormone signaling and gene regulation. , 2010, Annual review of physiology.

[3]  A. Sandelin,et al.  Applied bioinformatics for the identification of regulatory elements , 2004, Nature Reviews Genetics.

[4]  Graziano Pesole,et al.  An algorithm for finding signals of unknown length in DNA sequences , 2001, ISMB.

[5]  George Varghese,et al.  A uniform projection method for motif discovery in DNA sequences , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[6]  Ankush Mittal,et al.  Localized motif discovery in gene regulatory sequences , 2010, Bioinform..

[7]  Vsevolod J. Makeev,et al.  Deep and wide digging for binding motifs in ChIP-Seq data , 2010, Bioinform..

[8]  G. Church,et al.  Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[9]  John E. Reid,et al.  STEME: efficient EM to find motifs in large data sets , 2011, Nucleic acids research.

[10]  R. Shamir,et al.  Transcription factor and microRNA motif discovery: the Amadeus platform and a compendium of metazoan target sets. , 2008, Genome research.

[11]  E. Birney,et al.  Trawler: de novo regulatory motif discovery pipeline for chromatin immunoprecipitation , 2007, Nature Methods.

[12]  Xin Chen,et al.  TRANSFAC: an integrated system for gene expression regulation , 2000, Nucleic Acids Res..

[13]  Wing-Kin Sung,et al.  CENTDIST: discovery of co-associated factors by motif distribution , 2011, Nucleic Acids Res..

[14]  Jens Keilwagen,et al.  De-Novo Discovery of Differentially Abundant Transcription Factor Binding Sites Including Their Positional Preference , 2011, PLoS Comput. Biol..

[15]  Saurabh Sinha,et al.  On counting position weight matrix matches in a sequence, with application to discriminative motif finding , 2006, ISMB.

[16]  Siu-Ming Yiu,et al.  A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays , 2002, COCOON.

[17]  S. Batzoglou,et al.  Genome-Wide Analysis of Transcription Factor Binding Sites Based on ChIP-Seq Data , 2008, Nature Methods.

[18]  A. Sharov,et al.  Exhaustive Search for Over-represented DNA Sequence Motifs with CisFinder , 2009, DNA research : an international journal for rapid publication of reports on genes and genomes.

[19]  M. Berger,et al.  Protein binding microarrays (PBMs) for rapid, high-throughput characterization of the sequence specificities of DNA binding proteins. , 2006, Methods in molecular biology.

[20]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[21]  Saurabh Sinha,et al.  A Statistical Method for Finding Transcription Factor Binding Sites , 2000, ISMB.

[22]  Wing-Kin Sung,et al.  Cellular reprogramming by the conjoint action of ERα, FOXA1, and GATA3 to a ligand-inducible growth state , 2011, Molecular systems biology.

[23]  Renjie Jin,et al.  The role of hepatocyte nuclear factor-3 alpha (Forkhead Box A1) and androgen receptor in transcriptional regulation of prostatic genes. , 2003, Molecular endocrinology.

[24]  E. Lehmann Elements of large-sample theory , 1998 .

[25]  Panayiotis V. Benos,et al.  DNA Familial Binding Profiles Made Easy: Comparison of Various Motif Alignment and Clustering Strategies , 2007, PLoS Comput. Biol..

[26]  Siu-Ming Yiu,et al.  A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays , 2002, Algorithmica.

[27]  Nicola J. Rinaldi,et al.  Transcriptional regulatory code of a eukaryotic genome , 2004, Nature.

[28]  Qiang Wu,et al.  Mark the transition: chromatin modifications and cell fate decision , 2011, Cell Research.

[29]  Yijun Ruan,et al.  Mapping of transcription factor binding regions in mammalian cells by ChIP: comparison of array- and sequencing-based technologies. , 2007, Genome research.

[30]  O. Kallioniemi,et al.  Dual role of FoxA1 in androgen receptor binding to chromatin, androgen signalling and prostate cancer , 2011, The EMBO journal.

[31]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[32]  Donald L. Iglehart,et al.  Importance sampling for stochastic simulations , 1989 .

[33]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[34]  Jun S. Liu,et al.  An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments , 2002, Nature Biotechnology.

[35]  Xiaoyu Chen,et al.  RankMotif++: a motif-search algorithm that accounts for relative ranks of K-mers in binding transcription factors , 2007, ISMB/ECCB.

[36]  Z. Weng,et al.  Finding functional sequence elements by multiple local alignment. , 2004, Nucleic acids research.

[37]  Yongchao Liu,et al.  CUDA-MEME: Accelerating motif discovery in biological sequences using CUDA-enabled graphics processing units , 2010, Pattern Recognit. Lett..

[38]  Zhaohui S. Qin,et al.  On the detection and refinement of transcription factor binding sites using ChIP-Seq data , 2010, Nucleic acids research.

[39]  O. Kallioniemi,et al.  Dual role of FoxA 1 in androgen receptor binding to chromatin , androgen signalling and prostate cancer , 2011 .

[40]  N. D. Clarke,et al.  Integration of External Signaling Pathways with the Core Transcriptional Network in Embryonic Stem Cells , 2008, Cell.