Evaluation of methods for modeling transcription factor sequence specificity

Genomic analyses often involve scanning for potential transcription factor (TF) binding sites using models of the sequence specificity of DNA binding proteins. Many approaches have been developed to model and learn a protein's DNA-binding specificity, but these methods have not been systematically compared. Here we applied 26 such approaches to in vitro protein binding microarray data for 66 mouse TFs belonging to various families. For nine TFs, we also scored the resulting motif models on in vivo data, and found that the best in vitro–derived motifs performed similarly to motifs derived from the in vivo data. Our results indicate that simple models based on mononucleotide position weight matrices trained by the best methods perform similarly to more complex models for most TFs examined, but fall short in specific cases (<10% of the TFs examined here). In addition, the best-performing motifs typically have relatively low information content, consistent with widespread degeneracy in eukaryotic TF sequence preferences.

[1]  T. D. Schneider,et al.  Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. , 1982, Nucleic acids research.

[2]  P. V. von Hippel,et al.  Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. , 1987, Journal of molecular biology.

[3]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[4]  G. Stormo Consensus patterns in DNA. , 1990, Methods in enzymology.

[5]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[6]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[7]  W. Leonard,et al.  Importance of low affinity Elf-1 sites in the regulation of lymphoid- specific inducible gene expression , 1996, The Journal of experimental medicine.

[8]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[9]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[10]  M. Brodsky,et al.  A bacterial one-hybrid system for determining the DNA-binding specificity of transcription factors , 2005, Nature Biotechnology.

[11]  Terence P. Speed,et al.  Finding Short DNA Motifs Using Permuted Markov Models , 2005, J. Comput. Biol..

[12]  Christopher L. Warren,et al.  Defining the sequence-recognition profile of DNA-binding molecules. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Amos Tanay,et al.  Extensive low-affinity transcriptional interactions in the yeast genome. , 2006, Genome research.

[14]  A. Philippakis,et al.  Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities , 2006, Nature Biotechnology.

[15]  Alexandre V. Morozov,et al.  Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE , 2006, ISMB.

[16]  J. Kinney,et al.  Precise physical models of protein–DNA interaction from high-throughput data , 2007, Proceedings of the National Academy of Sciences.

[17]  A. Califano,et al.  Dialogue on Reverse‐Engineering Assessment and Methods , 2007, Annals of the New York Academy of Sciences.

[18]  S. Quake,et al.  A Systems Approach to Measuring the Binding Energy Landscapes of Transcription Factors , 2007, Science.

[19]  Xiaoyu Chen,et al.  RankMotif++: a motif-search algorithm that accounts for relative ranks of K-mers in binding transcription factors , 2007, ISMB/ECCB.

[20]  Hao-Geng Hung,et al.  Discovering gapped binding sites of yeast transcription factors , 2008, Proceedings of the National Academy of Sciences.

[21]  R. Shamir,et al.  Transcription factor and microRNA motif discovery: the Amadeus platform and a compendium of metazoan target sets. , 2008, Genome research.

[22]  Christopher L. Warren,et al.  A library of yeast transcription factor motifs reveals a widespread function for Rsc3 in targeting nucleosome exclusion at promoters. , 2008, Molecular cell.

[23]  Eran Segal,et al.  A Feature-Based Approach to Modeling Protein–DNA Interactions , 2007, RECOMB.

[24]  Anthony A. Philippakis,et al.  Design of Compact, Universal DNA Microarrays for Protein Binding Microarray Experiments , 2007, RECOMB.

[25]  Mauro Delorenzi,et al.  MAMOT: hidden Markov modeling tool , 2008, Bioinform..

[26]  Daniel E. Newburger,et al.  Variation in Homeodomain DNA Binding Revealed by High-Resolution Analysis of Sequence Preferences , 2008, Cell.

[27]  E. Segal,et al.  Predicting expression patterns from regulatory sequence in Drosophila segmentation , 2008, Nature.

[28]  Sean R Eddy,et al.  A new generation of homology search tools based on probabilistic inference. , 2009, Genome informatics. International Conference on Genome Informatics.

[29]  Yue Zhao,et al.  Inferring Binding Energies from Selected Binding Sites , 2009, PLoS Comput. Biol..

[30]  Gustavo Stolovitzky,et al.  Lessons from the DREAM2 Challenges , 2009, Annals of the New York Academy of Sciences.

[31]  Daniel E. Newburger,et al.  High-resolution DNA-binding specificity analysis of yeast transcription factors. , 2009, Genome research.

[32]  Andrea Califano,et al.  Lessons from the DREAM 2 Challenges A Community Effort to Assess Biological Network Inference , 2009 .

[33]  Daniel E. Newburger,et al.  Diversity and Complexity in DNA Recognition by Transcription Factors , 2009, Science.

[34]  I. Korf,et al.  Bind-n-Seq: high-throughput analysis of in vitro protein–DNA interactions using massively parallel sequencing , 2009, Nucleic acids research.

[35]  N. D. Clarke,et al.  Towards a Rigorous Assessment of Systems Biology Models: The DREAM3 Challenges , 2010, PloS one.

[36]  Juan M. Vaquerizas,et al.  Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. , 2010, Genome research.

[37]  Esther T. Chan,et al.  Conservation and regulatory associations of a wide affinity range of mouse transcription factor binding sites. , 2010, Genomics.

[38]  High resolution models of transcription factor-DNA affinities improve in vitro and in vivo binding predictions , 2010 .

[39]  R. Siddharthan Dinucleotide Weight Matrices for Predicting Transcription Factor Binding Sites: Generalizing the Position Weight Matrix , 2010, PloS one.

[40]  Vsevolod J. Makeev,et al.  Deep and wide digging for binding motifs in ChIP-Seq data , 2010, Bioinform..

[41]  G. Stormo,et al.  Determining the specificity of protein–DNA interactions , 2010, Nature Reviews Genetics.

[42]  S. Quake,et al.  De Novo Identification and Biophysical Characterization of Transcription Factor Binding Sites with Microfluidic Affinity Analysis , 2010, Nature Biotechnology.

[43]  William Stafford Noble,et al.  High Resolution Models of Transcription Factor-DNA Affinities Improve In Vitro and In Vivo Binding Predictions , 2010, PLoS Comput. Biol..

[44]  Andrew R. Gehrke,et al.  Genome-wide analysis of ETS-family DNA-binding in vitro and in vivo , 2010, The EMBO journal.

[45]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[46]  J. Kinney,et al.  Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence , 2010, Proceedings of the National Academy of Sciences.

[47]  H. Lähdesmäki,et al.  A Linear Model for Transcription Factor Binding Affinity Prediction in Protein Binding Microarrays , 2011, PloS one.

[48]  Timothy R. Hughes,et al.  Sequence specificity is obtained from the majority of modular C2H2 zinc-finger arrays , 2011, Nucleic acids research.

[49]  R. Mann,et al.  Cofactor Binding Evokes Latent Differences in DNA Binding Specificity between Hox Proteins , 2011, Cell.

[50]  Dennis B. Troup,et al.  NCBI GEO: archive for functional genomics data sets—10 years on , 2010, Nucleic Acids Res..

[51]  Jens Keilwagen,et al.  De-Novo Discovery of Differentially Abundant Transcription Factor Binding Sites Including Their Positional Preference , 2011, PLoS Comput. Biol..

[52]  Philip Machanick,et al.  MEME-ChIP: motif analysis of large DNA datasets , 2011, Bioinform..

[53]  G. Stormo,et al.  Quantitative analysis demonstrates most transcription factors require only simple models of specificity , 2011, Nature Biotechnology.

[54]  Li Chen,et al.  hmChIP: a database and web server for exploring publicly available human and mouse ChIP-seq and ChIP-chip data , 2011, Bioinform..

[55]  Ibrahim Emam,et al.  ArrayExpress update—an archive of microarray and high-throughput sequencing-based functional genomics experiments , 2010, Nucleic Acids Res..

[56]  S. Luo,et al.  Direct measurement of DNA affinity landscapes on a high-throughput sequencing instrument , 2011, Nature Biotechnology.

[57]  B. Pugh,et al.  Comprehensive Genome-wide Protein-DNA Interactions Detected at Single-Nucleotide Resolution , 2011, Cell.

[58]  Timothy R. Hughes,et al.  YeTFaSCo: a database of evaluated yeast transcription factor sequence specificities , 2011, Nucleic Acids Res..

[59]  G. Stormo,et al.  Improved Models for Transcription Factor Binding Site Identification Using Nonindependent Interactions , 2012, Genetics.

[60]  Mary Goldman,et al.  The UCSC Genome Browser database: extensions and updates 2011 , 2011, Nucleic Acids Res..