Recognition models to predict DNA-binding specificities of homeodomain proteins

Motivation: Recognition models for protein-DNA interactions, which allow the prediction of specificity for a DNA-binding domain based only on its sequence or the alteration of specificity through rational design, have long been a goal of computational biology. There has been some progress in constructing useful models, especially for C2H2 zinc finger proteins, but it remains a challenging problem with ample room for improvement. For most families of transcription factors the best available methods utilize k-nearest neighbor (KNN) algorithms to make specificity predictions based on the average of the specificities of the k most similar proteins with defined specificities. Homeodomain (HD) proteins are the second most abundant family of transcription factors, after zinc fingers, in most metazoan genomes, and as a consequence an effective recognition model for this family would facilitate predictive models of many transcriptional regulatory networks within these genomes. Results: Using extensive experimental data, we have tested several machine learning approaches and find that both support vector machines and random forests (RFs) can produce recognition models for HD proteins that are significant improvements over KNN-based methods. Cross-validation analyses show that the resulting models are capable of predicting specificities with high accuracy. We have produced a web-based prediction tool, PreMoTF (Predicted Motifs for Transcription Factors) (http://stormo.wustl.edu/PreMoTF), for predicting position frequency matrices from protein sequence using a RF-based model. Contact: stormo@wustl.edu

[1]  C. Pabo,et al.  DNA recognition by Cys2His2 zinc finger proteins. , 2000, Annual review of biophysics and biomolecular structure.

[2]  Aneel K. Aggarwal,et al.  Structure of a DNA-bound Ultrabithorax–Extradenticle homeodomain complex , 1999, Nature.

[3]  C. Pabo,et al.  Analysis of zinc fingers optimized via phage display: evaluating the utility of a recognition code. , 1999, Journal of molecular biology.

[4]  Robert D. Finn,et al.  Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins , 1999, Nucleic Acids Res..

[5]  P. Privalov,et al.  Protein–DNA Interaction , 2012 .

[6]  G. Stormo,et al.  Quantitative analysis demonstrates most transcription factors require only simple models of specificity , 2011, Nature Biotechnology.

[7]  C. Pabo,et al.  Geometric analysis and comparison of protein-DNA interfaces: why is there no simple code for recognition? , 2000, Journal of molecular biology.

[8]  E. Lewis A gene complex controlling segmentation in Drosophila , 1978, Nature.

[9]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[10]  C. Pabo,et al.  Crystallization and preliminary X-ray diffraction studies of the engrailed homeodomain and of an engrailed homeodomain/DNA complex. , 1990, Biochemical and biophysical research communications.

[11]  Panayiotis V. Benos,et al.  DNA Familial Binding Profiles Made Easy: Comparison of Various Motif Alignment and Clustering Strategies , 2007, PLoS Comput. Biol..

[12]  Gregory A Weiss,et al.  Dissecting the Engrailed homeodomain-DNA interaction by phage-displayed shotgun scanning. , 2004, Chemistry & biology.

[13]  Mona Singh,et al.  Predicting DNA recognition by Cys2His2 zinc finger proteins , 2009, Bioinform..

[14]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[15]  Panayiotis V Benos,et al.  Is there a code for protein-DNA recognition? Probab(ilistical)ly. . . , 2002, BioEssays : news and reviews in molecular, cellular and developmental biology.

[16]  T R Bürglin,et al.  Analysis of TALE superclass homeobox genes (MEIS, PBC, KNOX, Iroquois, TGIF) reveals a novel domain conserved between plants and animals. , 1997, Nucleic acids research.

[17]  Carl O. Pabo,et al.  Crystal structure of an engrailed homeodomain-DNA complex at 2.8 Å resolution: A framework for understanding homeodomain-DNA interactions , 1990, Cell.

[18]  B. Honig,et al.  Structure-based prediction of C2H2 zinc-finger binding specificity: sensitivity to docking geometry , 2007, Nucleic acids research.

[19]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[20]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[21]  Daniel E. Newburger,et al.  Variation in Homeodomain DNA Binding Revealed by High-Resolution Analysis of Sequence Preferences , 2008, Cell.

[22]  Anthony A. Philippakis,et al.  Predicting the binding preference of transcription factors to individual DNA k-mers , 2009, Bioinform..

[23]  Mona Singh,et al.  An expanded binding model for Cys2His2 zinc finger protein–DNA interfaces , 2011, Physical biology.

[24]  A Klug,et al.  Toward a code for the interactions of zinc fingers with DNA: selection of randomized fingers displayed on phage. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[25]  G. Tell,et al.  A molecular code dictates sequence‐specific DNA recognition by homeodomains. , 1996, The EMBO journal.

[26]  G. Stormo,et al.  Determining the specificity of protein–DNA interactions , 2010, Nature Reviews Genetics.

[27]  Gary D. Stormo,et al.  SAMIE: Statistical Algorithm for Modeling Interaction Energies , 2000, Pacific Symposium on Biocomputing.

[28]  Panayiotis V Benos,et al.  Probabilistic code for DNA recognition by proteins of the EGR family. , 2002, Journal of molecular biology.

[29]  Michael R. Green,et al.  Expressing the human genome , 2001, Nature.

[30]  Martha L. Bulyk,et al.  UniPROBE: an online database of protein binding microarray data on protein–DNA interactions , 2008, Nucleic Acids Res..

[31]  B W Matthews,et al.  Protein-DNA interaction. No code for recognition. , 1988, Nature.

[32]  K. Katoh,et al.  MAFFT version 5: improvement in accuracy of multiple sequence alignment , 2005, Nucleic acids research.

[33]  B. Sun,et al.  The degree of variation in DNA sequence recognition among four Drosophila homeotic proteins. , 1994, The EMBO journal.

[34]  Daniel E. Newburger,et al.  High-resolution DNA-binding specificity analysis of yeast transcription factors. , 2009, Genome research.

[35]  R. Henkin,et al.  Intranasal delivery to the brain , 2011, Nature Biotechnology.

[36]  Nir Friedman,et al.  Ab Initio Prediction of Transcription Factor Targets Using Structural Knowledge , 2005, PLoS Comput. Biol..

[37]  E. Fraenkel,et al.  Engrailed homeodomain-DNA complex at 2.2 A resolution: a detailed view of the interface and comparison with other engrailed structures. , 1998, Journal of molecular biology.

[38]  A Klug,et al.  Selection of DNA binding sites for zinc fingers using rationally randomized DNA reveals coded interactions. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[39]  Cynthia Wolberger,et al.  Crystal structure of a MAT alpha 2 homeodomain-operator complex suggests a general model for homeodomain-DNA interactions. , 1991, Cell.

[40]  Gary D. Stormo,et al.  Context-dependent DNA recognition code for C2H2 zinc-finger transcription factors , 2008, Bioinform..

[41]  W. Gehring,et al.  Homeodomain proteins. , 1994, Annual review of biochemistry.

[42]  T. D. Schneider,et al.  Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. , 1982, Nucleic acids research.

[43]  A Klug,et al.  Physical basis of a protein-DNA recognition code. , 1997, Current opinion in structural biology.

[44]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[45]  Ting Wang,et al.  Identifying the conserved network of cis-regulatory sites of a eukaryotic genome. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[46]  Antonina Silkov,et al.  Structural alignment of protein--DNA interfaces: insights into the determinants of binding specificity. , 2005, Journal of molecular biology.

[47]  Saurabh Sinha,et al.  FlyFactorSurvey: a database of Drosophila transcription factor binding specificities determined using the bacterial one-hybrid system , 2010, Nucleic Acids Res..

[48]  G. Stormo Maximally Efficient Modeling of DNA Sequence Motifs at All Levels of Complexity , 2011, Genetics.

[49]  N. Seeman,et al.  Sequence-specific Recognition of Double Helical Nucleic Acids by Proteins (base Pairs/hydrogen Bonding/recognition Fidelity/ion Binding) , 2022 .

[50]  R. Sauer,et al.  Specificity of minor-groove and major-groove interactions in a homeodomain-DNA complex. , 1995, Biochemistry.

[51]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[52]  Brian W. Matthews,et al.  No code for recognition , 1988, Nature.

[53]  Panayiotis V. Benos,et al.  Inferring protein-DNA dependencies using motif alignments and mutual information , 2007, ISMB/ECCB.

[54]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[55]  Gregory B. Gloor,et al.  Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction , 2008, Bioinform..

[56]  G. Stormo,et al.  Analysis of Homeodomain Specificities Allows the Family-wide Prediction of Preferred Recognition Sites , 2008, Cell.