Computational modeling of in vivo and in vitro protein‐DNA interactions by multiple instance learning

Motivation: The study of transcriptional regulation is still difficult yet fundamental in molecular biology research. While the development of both in vivo and in vitro profiling techniques have significantly enhanced our knowledge of transcription factor (TF)‐DNA interactions, computational models of TF‐DNA interactions are relatively simple and may not reveal sufficient biological insight. In particular, supervised learning based models for TF‐DNA interactions attempt to map sequence‐level features (k‐mers) to binding event but usually ignore the location of k‐mers, which can cause data fragmentation and consequently inferior model performance. Results: Here, we propose a novel algorithm based on the so‐called multiple‐instance learning (MIL) paradigm. MIL breaks each DNA sequence into multiple overlapping subsequences and models each subsequence separately, therefore implicitly takes into consideration binding site locations, resulting in both higher accuracy and better interpretability of the models. The result from both in vivo and in vitro TF‐DNA interaction data show that our approach significantly outperform conventional single‐instance learning based algorithms. Importantly, the models learned from in vitro data using our approach can predict in vivo binding with very good accuracy. In addition, the location information obtained by our method provides additional insight for motif finding results from ChIP‐Seq data. Finally, our approach can be easily combined with other state‐of‐the‐art TF‐DNA interaction modeling methods. Availability and Implementation: http://www.cs.utsa.edu/˜jruan/MIL/ Contact: jianhua.ruan@utsa.edu Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Raja Jothi,et al.  Genome-wide identification of in vivo protein–DNA binding sites from ChIP-Seq data , 2008, Nucleic acids research.

[2]  Geoffrey H. Siwo,et al.  Prediction of fine-tuned promoter activity from DNA sequence , 2015, bioRxiv.

[3]  Alexandre V. Morozov,et al.  Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE , 2006, ISMB.

[4]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[5]  Eibe Frank,et al.  Applying propositional learning algorithms to multi-instance data , 2003 .

[6]  B. Ray,et al.  Concerted Participation of NF-κB and C/EBP Heteromer in Lipopolysaccharide Induction of Serum Amyloid A Gene Expression in Liver (*) , 1995, The Journal of Biological Chemistry.

[7]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[8]  Chaochun Wei,et al.  MOST+: A de novo motif finding approach combining genomic sequence and heterogeneous genome-wide signatures , 2015, BMC Genomics.

[9]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[10]  H. Lähdesmäki,et al.  A Linear Model for Transcription Factor Binding Affinity Prediction in Protein Binding Microarrays , 2011, PloS one.

[11]  Jianhua Ruan,et al.  A structure-based Multiple-Instance Learning approach to predicting in vitro transcription factor-DNA interaction , 2015, BMC Genomics.

[12]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[13]  Fangping Mu,et al.  Using Sequence-Specific Chemical and Structural Properties of DNA to Predict Transcription Factor Binding Sites , 2010, PLoS Comput. Biol..

[14]  William Stafford Noble,et al.  Integrative annotation of chromatin elements from ENCODE data , 2012, Nucleic acids research.

[15]  A. Califano,et al.  Dialogue on Reverse‐Engineering Assessment and Methods , 2007, Annals of the New York Academy of Sciences.

[16]  E. Birney,et al.  High-resolution genome-wide in vivo footprinting of diverse transcription factors in human cells. , 2011, Genome research.

[17]  Tomás Lozano-Pérez,et al.  A Framework for Multiple-Instance Learning , 1997, NIPS.

[18]  Terence P. Speed,et al.  Finding Short DNA Motifs Using Permuted Markov Models , 2005, J. Comput. Biol..

[19]  Davide Heller,et al.  STRING v10: protein–protein interaction networks, integrated over the tree of life , 2014, Nucleic Acids Res..

[20]  K. Plath,et al.  The roles of the reprogramming factors Oct4, Sox2 and Klf4 in resetting the somatic cell epigenome during induced pluripotent stem cell generation , 2012, Genome Biology.

[21]  Jinke Wang,et al.  c-Jun binding site identification in K562 cells. , 2011, Journal of genetics and genomics = Yi chuan xue bao.

[22]  Kate B. Cook,et al.  Determination and Inference of Eukaryotic Transcription Factor Sequence Specificity , 2014, Cell.

[23]  S. Luo,et al.  Direct measurement of DNA affinity landscapes on a high-throughput sequencing instrument , 2011, Nature Biotechnology.

[24]  Jacob F. Degner,et al.  Sequence and Chromatin Accessibility Data Accurate Inference of Transcription Factor Binding from Dna Material Supplemental Open Access , 2022 .

[25]  C. Glass,et al.  Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. , 2010, Molecular cell.

[26]  S. Quake,et al.  A Systems Approach to Measuring the Binding Energy Landscapes of Transcription Factors , 2007, Science.

[27]  P. V. von Hippel,et al.  Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. , 1987, Journal of molecular biology.

[28]  Martha L. Bulyk,et al.  UniPROBE, update 2015: new tools and content for the online database of protein-binding microarray data on protein–DNA interactions , 2014, Nucleic Acids Res..

[29]  Atina G. Coté,et al.  Evaluation of methods for modeling transcription factor sequence specificity , 2013, Nature Biotechnology.

[30]  Jun S. Liu,et al.  Integrating regulatory motif discovery and genome-wide expression analysis , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[31]  D. Haussler,et al.  Boolean Feature Discovery in Empirical Learning , 1990, Machine Learning.

[32]  R. Young,et al.  Rapid analysis of the DNA-binding specificities of transcription factors with DNA microarrays , 2004, Nature Genetics.

[33]  Thomas Hofmann,et al.  Support Vector Machines for Multiple-Instance Learning , 2002, NIPS.

[34]  Daniel E. Newburger,et al.  Diversity and Complexity in DNA Recognition by Transcription Factors , 2009, Science.

[35]  G. Stormo Consensus patterns in DNA. , 1990, Methods in enzymology.

[36]  B. Porse,et al.  codes for homeostatic and cell cycle gene batteries regeneration reveals dynamic occupancy and specific regulatory Temporal mapping of CEBPA and CEBPB binding during liver Material , 2013 .

[37]  Satoru Takahashi,et al.  Comprehensive Identification of Krüppel-Like Factor Family Members Contributing to the Self-Renewal of Mouse Embryonic Stem Cells and Cellular Reprogramming , 2016, PloS one.

[38]  Shi-Hua Zhang,et al.  IIIDB: a database for isoform-isoform interactions and isoform network modules , 2015, BMC Genomics.

[39]  Ottar Hellevik,et al.  Linear versus logistic regression when the dependent variable is a dichotomy , 2009 .

[40]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[41]  Shwu‐Yuan Wu,et al.  Binding Site Specificity and Factor Redundancy in Activator Protein-1-driven Human Papillomavirus Chromatin-dependent Transcription* , 2011, The Journal of Biological Chemistry.

[42]  H. Bussemaker,et al.  Regulatory element detection using correlation with expression , 2001, Nature Genetics.

[43]  William Stafford Noble,et al.  Epigenetic priors for identifying active transcription factor binding sites , 2012, Bioinform..

[44]  Peter Auer,et al.  On Learning From Multi-Instance Examples: Empirical Evaluation of a Theoretical Approach , 1997, ICML.

[45]  Harmen J. Bussemaker,et al.  REDUCE: an online tool for inferring cis-regulatory elements and transcriptional module activities from microarray data , 2003, Nucleic Acids Res..