Prediction of transcription factor binding to DNA using rule induction methods

In this study, we seek to develop a predictive model for finding the strength of binding between a particular transcription factor (TF) variant and a particular DNA target variant. The DNA binding paired domains of the Pax transcription factors, which are our main focus, show seemingly fuzzy and degenerate binding to various DNA targets, and paired domain-DNA binding is not a problem well suited for previously proposed algorithms. Here, we introduce a simple way to use rule induction for predicting the strength of TFDNA binding. We have created a dataset consisting of 597 example cases for paired domain-DNA binding by collecting information about all published and quantified interactions between TF and DNA sequence variants. Application of the rule induction based method on this dataset yields a high, although far from ideal accuracy of 69.7% (based on cross-validation), but perhaps more importantly, several useful rules for predicting the binding strength have been found. Although the primary motivation for introducing the rule induction based methods is the lack of efficient algorithms for paired domain-DNA binding prediction, we also show that the method can be applied with some success to a more well-studied TF-DNA binding prediction task involving the early growth response (EGR) TF family. Summary The transcription of DNA into mRNA is initiated and aided by a number of transcription factors (TFs), proteins with DNA-binding regions that attach themselves to binding sites in the DNA (transcription factor binding sites, TFBSs). As it has become apparent that both TFs and TFBSs are highly variable, tools are needed to quantify the strength of the interaction resulting from a certain TF variant binding to a certain TFBS. Ideally, one would like to have a method where any combination of TF amino acids are allowed to interact with any TFBS nucleotide, and vice versa. Rule induction algorithms might be such a method. We used a simple way to predict interactions between protein and DNA: given experimental cases from the literature where the interaction strength between two sequences has been quantified, we created training vectors for rule induction by regarding each amino acid and nucleotide position as a single feature in the example vector. The resulting interaction strength was used as the target class or value. These training vectors were then used to build a rule induction model. We applied the rule induction method to two protein families – transcription factors from the Pax and the early growth response (EGR) families – and their corresponding DNA targets.

[1]  W. Gehring,et al.  Differential interactions of eyeless and twin of eyeless with the sine oculis enhancer. , 2002, Development.

[2]  Carl O. Pabo,et al.  A General Strategy for Selecting High-Affinity Zinc Finger Proteins for Diverse DNA Target Sites , 1997, Science.

[3]  Trevor Hastie,et al.  Gene expression patterns in ovarian carcinomas. , 2003, Molecular biology of the cell.

[4]  S Wold,et al.  A multivariate representation and analysis of DNA sequence data. , 1991, Acta chemica Scandinavica.

[5]  G. Church,et al.  Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. , 2002, Nucleic acids research.

[6]  A Klug,et al.  Selection of DNA binding sites for zinc fingers using rationally randomized DNA reveals coded interactions. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[7]  A. Kahn,et al.  Novel Target Sequences for Pax-6 in the Brain-specific Activating Regions of the Rat Aldolase C Gene* , 2002, The Journal of Biological Chemistry.

[8]  D. Schmucker,et al.  Direct regulation of rhodopsin 1 by Pax-6/eyeless in Drosophila: evidence for a conserved function in photoreceptors. , 1997, Genes & development.

[9]  D. Hayward,et al.  Pax gene diversity in the basal cnidarian Acropora millepora (Cnidaria, Anthozoa): implications for the evolution of the Pax gene family. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[10]  M J May,et al.  NF-kappa B and Rel proteins: evolutionarily conserved mediators of immune responses. , 1998, Annual review of immunology.

[11]  Claude Desplan,et al.  Crystal structure of a paired domain-DNA complex at 2.5 å resolution reveals structural basis for pax developmental mutations , 1995, Cell.

[12]  A. Goriely,et al.  Lune/eye gone, a Pax-like protein, uses a partial paired domain and a homeodomain for DNA recognition. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Herbert A. Simon,et al.  Applications of machine learning and rule induction , 1995, CACM.

[14]  R. Lutz,et al.  Highly conserved amino acids in Pax and Ets proteins are required for DNA binding and ternary complex assembly. , 2001, Nucleic acids research.

[15]  J A Epstein,et al.  Two independent and interactive DNA-binding subdomains of the Pax6 paired domain are regulated by alternative splicing. , 1994, Genes & development.

[16]  S. Hodgson,et al.  The human PAX6 gene is mutated in two patients with aniridia , 1992, Nature Genetics.

[17]  M. Busslinger,et al.  DNA-binding and transactivation properties of Pax-6: three amino acids in the paired domain are responsible for the different sequence recognition of Pax-6 and BSAP (Pax-5) , 1995, Molecular and cellular biology.

[18]  S. Wold,et al.  New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids. , 1998, Journal of medicinal chemistry.

[19]  Thomas Lengauer,et al.  Diversity and complexity of HIV-1 drug resistance: A bioinformatics approach to predicting phenotype from genotype , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[20]  C. Desplan,et al.  Modular Organization of Pax/Homeodomain Proteins in Transcriptional Regulation , 1997, Biological chemistry.

[21]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[22]  P. Gruss,et al.  Retinal pigmented epithelium determination requires the redundant activities of Pax2 and Pax6 , 2003, Development.

[23]  M Gerstein,et al.  DNA recognition code of transcription factors. , 1995, Protein engineering.

[24]  C. Garvie,et al.  Requirements for selective recruitment of Ets proteins and activation of mb-1/Ig-alpha gene transcription by Pax-5 (BSAP). , 2003, Nucleic acids research.

[25]  A Klug,et al.  Physical basis of a protein-DNA recognition code. , 1997, Current opinion in structural biology.

[26]  Heinz-Theodor Mevissen,et al.  Decision tree-based formation of consensus protein secondary structure prediction , 1999, Bioinform..

[27]  L. McIntosh,et al.  The Highly Conserved β-Hairpin of the Paired DNA-Binding Domain Is Required for Assembly of Pax-Ets Ternary Complexes , 1999, Molecular and Cellular Biology.

[28]  G. Damante,et al.  A network of specific minor-groove contacts is a common characteristic of paired-domain-DNA interactions. , 1996, The Biochemical journal.

[29]  J R Desjarlais,et al.  Toward rules relating zinc finger protein sequences and DNA binding site preferences. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[30]  G. Schaffner,et al.  DNA sequence recognition by Pax proteins: bipartite structure of the paired domain and its binding site. , 1993, Genes & development.

[31]  G. Edelman,et al.  A binding site for homeodomain and Pax proteins is necessary for L1 cell adhesion molecule gene expression by Pax-6 and bone morphogenetic proteins. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[32]  P. Callaerts,et al.  Gene duplication and recruitment of a specific tropomyosin into striated muscle cells in the jellyfish Podocoryne carnea. , 1999, The Journal of experimental zoology.

[33]  G. Halder,et al.  twin of eyeless, a second Pax-6 gene of Drosophila, acts upstream of eyeless in the control of eye development. , 1999, Molecular cell.

[34]  A. Aguzzi,et al.  Pax-5 encodes the transcription factor BSAP and is expressed in B lymphocytes, the developing CNS, and adult testis. , 1992, Genes & development.

[35]  Panayiotis V Benos,et al.  Probabilistic code for DNA recognition by proteins of the EGR family. , 2002, Journal of molecular biology.

[36]  K. Vogan,et al.  An alternative splicing event in the Pax-3 paired domain identifies the linker region as a key determinant of paired domain DNA-binding activity , 1996, Molecular and cellular biology.

[37]  K. Catron,et al.  Nucleotides flanking a conserved TAAT core dictate the DNA binding specificity of three murine homeodomain proteins , 1993, Molecular and Cellular Biology.

[38]  J. Epstein,et al.  Pax3 modulates expression of the c-Met receptor during limb muscle development. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[39]  P. Callaerts,et al.  Characterization and expression analysis of an ancestor-type Pax gene in the hydrozoan jellyfish Podocoryne carnea , 2000, Mechanisms of Development.

[40]  G. Stormo,et al.  Additivity in protein-DNA interactions: how good an approximation is it? , 2002, Nucleic acids research.

[41]  Henrik Boström,et al.  Covering vs. Divide-and-Conquer for Top-Down Induction of Logic Programs , 1995, IJCAI.

[42]  Z. Kozmík,et al.  Role of Pax genes in eye evolution: a cnidarian PaxB gene uniting Pax2 and Pax6 functions. , 2003, Developmental cell.

[43]  R. Balling,et al.  Pax1 and Pax9 activate Bapx1 to induce chondrogenic differentiation in the sclerotome , 2003, Development.

[44]  Christoph Adami,et al.  Information theory in molecular biology , 2004, q-bio/0405004.

[45]  W. Li,et al.  Isolation of Cladonema Pax-B genes and studies of the DNA-binding properties of cnidarian Pax paired domains. , 2001, Molecular biology and evolution.

[46]  Panayiotis V Benos,et al.  Is there a code for protein-DNA recognition? Probab(ilistical)ly. . . , 2002, BioEssays : news and reviews in molecular, cellular and developmental biology.

[47]  T. Friedman,et al.  A frameshift mutation in the HuP2 paired domain of the probable human homolog of murine Pax-3 is responsible for Waardenburg syndrome type 1 in an Indonesian family. , 1992, Human molecular genetics.

[48]  C. Garvie,et al.  Structural studies of Ets-1/Pax5 complex formation on DNA. , 2001, Molecular cell.

[49]  M. Goulding,et al.  Molecular basis of splotch and Waardenburg Pax-3 mutations. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[50]  D. Hewett‐Emmett,et al.  Evolution of paired domains: isolation and sequencing of jellyfish and hydra Pax genes related to Pax-5 and Pax-6. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[51]  G. Chalepakis,et al.  Identification of DNA recognition sequences for the Pax3 paired domain. , 1995, Gene.

[52]  P. Gros,et al.  Reciprocal effect of Waardenburg syndrome mutations on DNA binding by the Pax-3 paired domain and homeodomain. , 1997, Human molecular genetics.

[53]  D. Larhammar,et al.  Mutational analysis of the Acropora millepora PaxD paired domain highlights the importance of the linker region for DNA binding. , 2003, Gene.

[54]  M. Goulding,et al.  The molecular basis of the undulated/Pax-1 mutation , 1991, Cell.

[55]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[56]  J. Epstein,et al.  Identification of a Pax paired domain recognition sequence and evidence for DNA-dependent conformational changes. , 1994, The Journal of biological chemistry.

[57]  J A Epstein,et al.  Crystal structure of the human Pax6 paired domain-DNA complex reveals specific roles for the linker region and carboxy-terminal subdomain in DNA binding. , 1999, Genes & development.

[58]  J. Epstein,et al.  Getting your Pax straight: Pax proteins in development and disease. , 2002, Trends in genetics : TIG.

[59]  W. Gehring,et al.  DNA-binding characteristics of cnidarian Pax-C and Pax-B proteins in vivo and in vitro: no simple relationship with the Pax-6 and Pax-2/5/8 classes. , 2003, Journal of experimental zoology. Part B, Molecular and developmental evolution.

[60]  C. Desplan,et al.  Cooperative interactions between paired domain and homeodomain. , 1996, Development.

[61]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[62]  G. Edelman,et al.  Pax-3 contains domains for transcription activation and transcription inhibition. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[63]  V. Beneš,et al.  DNA binding and transactivating properties of the paired and homeobox protein Pax4. , 1999, Biochemical and biophysical research communications.

[64]  K. Vogan,et al.  The C-terminal Subdomain Makes an Important Contribution to the DNA Binding Activity of the Pax-3 Paired Domain* , 1997, The Journal of Biological Chemistry.

[65]  H. Margalit,et al.  Quantitative parameters for amino acid-base interaction: implications for prediction of protein-DNA binding sites. , 1998, Nucleic acids research.

[66]  A. Scaloni,et al.  Redox Potential Controls the Structure and DNA Binding Activity of the Paired Domain* , 1998, The Journal of Biological Chemistry.