Integrating sequence and gene expression information predicts genome-wide DNA-binding proteins and suggests a cooperative mechanism

Abstract DNA-binding proteins (DBPs) perform diverse biological functions ranging from transcription to pathogen sensing. Machine learning methods can not only identify DBPs de novo but also provide insights into their DNA-recognition dynamics. However, it remains unclear whether available methods that can accurately predict DNA-binding sites in known DBPs can also identify novel DBPs. Moreover, sequence information is blind to the cellular- and disease-specific contexts of DBP activities, whereas the under-utilized knowledge from public gene expression data offers great promise. To address these issues, we have developed novel methods for predicting DBPs by integrating sequence and gene expression-derived features and applied them to explore human, mouse and Arabidopsis proteomes. While our sequence-based models outperformed the gene expression-based ones, some proteins with weaker DBP-like sequence features were correctly predicted by gene expression-based features, suggesting that these proteins acquire a tangible DBP functionality in a conducive gene expression environment. Analysis of motif enrichment among the co-expressed genes of top 100 candidates DBPs from hitherto unannotated genes provides further avenues to explore their functional associations.

[1]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Shandar Ahmad,et al.  Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information , 2004, Bioinform..

[3]  Akinori Sarai,et al.  Moment-based prediction of DNA-binding proteins. , 2004, Journal of molecular biology.

[4]  Liangjiang Wang,et al.  BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences , 2006, Nucleic Acids Res..

[5]  Shandar Ahmad,et al.  Sequence and structural features of carbohydrate binding in proteins and assessment of predictability using a neural network , 2007, BMC Structural Biology.

[6]  Max Kuhn,et al.  The caret Package , 2007 .

[7]  Ozlem Keskin,et al.  Protein–DNA interactions: structural, thermodynamic and clustering patterns of conserved residues in DNA-binding proteins , 2008, Nucleic acids research.

[8]  Kenji Mizuguchi,et al.  Prediction of mono- and di-nucleotide-specific DNA-binding sites in proteins using neural networks , 2009, BMC Structural Biology.

[9]  W. Huber,et al.  Model-based variance-stabilizing transformation for Illumina microarray data , 2008, Nucleic acids research.

[10]  Jeffrey Skolnick,et al.  DBD-Hunter: a knowledge-based method for the prediction of DNA–protein interactions , 2008, Nucleic acids research.

[11]  N. Bolduc,et al.  Dual Functions of the KNOTTED1 Homeodomain: Sequence-Specific DNA Binding and Regulation of Cell-to-Cell Transport , 2008, Science Signaling.

[12]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[13]  Benchmarking and analysis of DNA-binding site prediction using machine learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[14]  Yen-Jen Oyang,et al.  DNA-binding residues and binding mode prediction with binding-mechanism concerned models , 2009, BMC Genomics.

[15]  Jonathan D. Wren,et al.  A global meta-analysis of microarray expression data to predict unknown gene functions and estimate the literature-data divide , 2009, Bioinform..

[16]  Oliver Fiehn,et al.  What are the obstacles for an integrated system for comprehensive interpretation of cross-platform metabolic profile data? , 2009, Bioanalysis.

[17]  Xiao Sun,et al.  Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature , 2008, Bioinform..

[18]  Wei-Chung Cheng,et al.  Microarray meta-analysis database (M2DB): a uniformly pre-processed, quality controlled, and manually curated human clinical microarray database , 2010, BMC Bioinformatics.

[19]  Lu Xie,et al.  A novel sequence-based method of predicting protein DNA-binding residues, using a machine learning approach , 2010, Molecules and cells.

[20]  Adeel Malik,et al.  Residue propensities, discrimination and binding site prediction of adenine and guanine phosphates , 2011, BMC Biochemistry.

[21]  Jack Y. Yang,et al.  BindN+ for accurate prediction of DNA and RNA-binding residues from protein sequence features , 2010, BMC Systems Biology.

[22]  V. de Crécy-Lagard,et al.  Mining high-throughput experimental data to link gene and function. , 2011, Trends in biotechnology.

[23]  Mikhail G. Dozmorov,et al.  Predicting gene ontology from a global meta-analysis of 1-color microarray experiments , 2011, BMC Bioinformatics.

[24]  Dennis B. Troup,et al.  NCBI GEO: archive for functional genomics data sets—10 years on , 2010, Nucleic Acids Res..

[25]  Philip Machanick,et al.  MEME-ChIP: motif analysis of large DNA datasets , 2011, Bioinform..

[26]  K. Mizuguchi,et al.  Partner-Aware Prediction of Interacting Residues in Protein-Protein Complexes from Sequence Data , 2011, PloS one.

[27]  Lokesh P. Tripathi,et al.  TargetMine, an Integrated Data Warehouse for Candidate Gene Prioritisation and Target Discovery , 2011, PloS one.

[28]  Hong Yan,et al.  Prediction of DNA-binding protein based on statistical and geometric features and support vector machines , 2011, Proteome Science.

[29]  Michael Schroeder,et al.  MetaDBSite: a meta approach to improve protein DNA-binding sites prediction , 2011, BMC Systems Biology.

[30]  Shandar Ahmad,et al.  Computational Methods for Predicting DNA-Binding Sites at a Genomic Scale , 2011 .

[31]  Yi Xiong,et al.  An accurate feature‐based method for identifying DNA‐binding residues on protein surfaces , 2011, Proteins.

[32]  Daisuke Kihara Protein Function Prediction for Omics Era , 2011 .

[33]  Shandar Ahmad,et al.  Prediction of dinucleotide-specific RNA-binding sites in proteins , 2011, BMC Bioinformatics.

[34]  Honglin Li,et al.  An improved sequence based prediction protocol for DNA-binding proteins using SVM and comprehensive feature analysis , 2012, BMC Bioinformatics.

[35]  K. Fitzgerald,et al.  Molecular Basis of DNA Recognition in the Immune System , 2013, The Journal of Immunology.

[36]  Xue-wen Chen,et al.  Heterogeneous data integration by tree‐augmented naïve Bayes for protein–protein interactions prediction , 2013, Proteomics.

[37]  Jiansheng Wu,et al.  Identification of DNA-Binding Proteins Using Support Vector Machine with Sequence Information , 2013, Comput. Math. Methods Medicine.

[38]  R. Nagarajan,et al.  Novel approach for selecting the best predictor for identifying the binding sites in DNA binding proteins , 2013, Nucleic acids research.

[39]  Tao Li,et al.  PreDNA: accurate prediction of DNA-binding sites in proteins by integrating sequence and geometric structure information , 2013, Bioinform..

[40]  Yuedong Yang,et al.  Predicting DNA-Binding Proteins and Binding Residues by Complex Structure Prediction and Application to Human Proteome , 2014, PloS one.

[41]  Wei Wang,et al.  Analysis and classification of DNA-binding sites in single-stranded and double-stranded DNA-binding proteins using protein information. , 2014, IET systems biology.

[42]  Yingfeng Wang,et al.  A graph kernel method for DNA-binding site prediction , 2014, BMC Systems Biology.

[43]  Qing Zhou,et al.  A penalized Bayesian approach to predicting sparse protein-DNA binding landscapes , 2014, Bioinform..

[44]  William Stafford Noble,et al.  Motif-based analysis of large nucleotide data sets using MEME-ChIP , 2014, Nature Protocols.

[45]  K. Mizuguchi,et al.  Conformational changes in DNA‐binding proteins: Relationships with precomplex features and contributions to specificity and stability , 2014, Proteins.

[46]  B. Liu,et al.  iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance-Pairs and Reduced Alphabet Profile into the General Pseudo Amino Acid Composition , 2014, PloS one.

[47]  Xiao-hui Niu,et al.  Predicting DNA binding proteins using support vector machine with hybrid fractal features. , 2014, Journal of theoretical biology.

[48]  Bo Jiang,et al.  Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naïve Bayes , 2014, PloS one.

[49]  Byungkyu Brian Park,et al.  Sequence-based prediction of protein-binding sites in DNA: Comparative study of two SVM models , 2014, Comput. Methods Programs Biomed..

[50]  Chen Zhang,et al.  newDNA-Prot: Prediction of DNA-binding proteins by employing support vector machine and a comprehensive sequence representation , 2014, Comput. Biol. Chem..

[51]  Ying Wang,et al.  STING-dependent cytosolic DNA sensing mediates innate immune recognition of immunogenic tumors. , 2014, Immunity.

[52]  J. Drouin,et al.  Pax factors in transcription and epigenetic remodelling. , 2015, Seminars in cell & developmental biology.

[53]  J. Casadesús,et al.  DNA methylation in bacteria: from the methyl group to the methylome. , 2015, Current opinion in microbiology.

[54]  D. Choi,et al.  Functional studies of transcription factors involved in plant defenses in the genomics era. , 2015, Briefings in functional genomics.

[55]  B. Miotto,et al.  Emerging Concept in DNA Methylation: Role of Transcription Factors in Shaping DNA Methylation Patterns , 2015, Journal of cellular physiology.

[56]  Lukasz Kurgan,et al.  High-throughput prediction of RNA, DNA and protein binding regions mediated by intrinsic disorder , 2015, Nucleic acids research.

[57]  María Martín,et al.  UniProt: A hub for protein information , 2015 .

[58]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[59]  Ying Li,et al.  From milliseconds to lifetimes: tracking the dynamic behavior of transcription factors in gene networks. , 2015, Trends in genetics : TIG.

[60]  Liu Cao,et al.  DNA Methylation, Its Mediators and Genome Integrity , 2015, International journal of biological sciences.

[61]  Zhichao Miao,et al.  Prediction of nucleic acid binding probability in proteins: a neighboring residue network based score , 2015, Nucleic acids research.

[62]  P. Zhu,et al.  Sox2 functions as a sequence-specific DNA sensor in neutrophils to initiate innate immunity against microbial infection , 2015, Nature Immunology.

[63]  F. Dai,et al.  Genome-wide identification and characterization of Fox genes in the silkworm, Bombyx mori , 2015, Functional & Integrative Genomics.

[64]  D. Knipe Nuclear sensing of viral DNA, epigenetic regulation of herpes simplex virus infection, and innate immunity. , 2015, Virology.

[65]  Robert A. Carter,et al.  Critical Role for the DNA Sensor AIM2 in Stem Cell Proliferation and Cancer , 2015, Cell.

[66]  A. Bowie,et al.  Innate immune recognition of DNA: A recent history. , 2015, Virology.

[67]  Ilja Westerlaken,et al.  The DNA-Binding Protein from Starved Cells (Dps) Utilizes Dual Functions To Defend Cells against Multiple Stresses , 2015, Journal of bacteriology.

[68]  Kenji Mizuguchi,et al.  An integrative data analysis platform for gene set analysis and knowledge discovery in a data warehouse framework , 2016, Database J. Biol. Databases Curation.

[69]  R. DePinho,et al.  Forkhead box O transcription factors in chondrocytes regulate endochondral bone formation , 2016, The Journal of Steroid Biochemistry and Molecular Biology.

[70]  Lukasz Kurgan,et al.  DRNApred, fast sequence-based method that accurately predicts and discriminates DNA- and RNA-binding residues , 2017, Nucleic acids research.