DHSpred: support-vector-machine-based human DNase I hypersensitive sites prediction using the optimal features selected by random forest

DNase I hypersensitive sites (DHSs) are genomic regions that provide important information regarding the presence of transcriptional regulatory elements and the state of chromatin. Therefore, identifying DHSs in uncharacterized DNA sequences is crucial for understanding their biological functions and mechanisms. Although many experimental methods have been proposed to identify DHSs, they have proven to be expensive for genome-wide application. Therefore, it is necessary to develop computational methods for DHS prediction. In this study, we proposed a support vector machine (SVM)-based method for predicting DHSs, called DHSpred (DNase I Hypersensitive Site predictor in human DNA sequences), which was trained with 174 optimal features. The optimal combination of features was identified from a large set that included nucleotide composition and di- and trinucleotide physicochemical properties, using a random forest algorithm. DHSpred achieved a Matthews correlation coefficient and accuracy of 0.660 and 0.871, respectively, which were 3% higher than those of control SVM predictors trained with non-optimized features, indicating the efficiency of the feature selection method. Furthermore, the performance of DHSpred was superior to that of state-of-the-art predictors. An online prediction server has been developed to assist the scientific community, and is freely available at: http://www.thegleelab.org/DHSpred.html

[1]  Nathan C. Sheffield,et al.  The accessible chromatin landscape of the human genome , 2012, Nature.

[2]  Muhammad Kabir,et al.  Predicting DNase I hypersensitive sites via un-biased pseudo trinucleotide composition , 2017 .

[3]  Xiaolong Wang,et al.  repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects , 2015, Bioinform..

[4]  Miao Sun,et al.  QAcon: single model quality assessment using protein structural and contact information with machine learning techniques , 2016, Bioinform..

[5]  Kuo-Chen Chou,et al.  iATC-mISF: a multi-label classifier for predicting the classes of anatomical therapeutic chemicals , 2017, Bioinform..

[6]  Michael Eickenberg,et al.  Machine learning for neuroimaging with scikit-learn , 2014, Front. Neuroinform..

[7]  Sangdun Choi,et al.  Molecular Modeling-Based Evaluation of hTLR10 and Identification of Potential Ligands in Toll-Like Receptor Signaling , 2010, PloS one.

[8]  Vineet K. Sharma,et al.  Prediction of anti-inflammatory proteins/peptides: an insilico approach , 2016, Journal of Translational Medicine.

[9]  Balachandran Manavalan,et al.  MLACP: machine-learning-based prediction of anticancer peptides , 2017, Oncotarget.

[10]  Jooyoung Lee,et al.  SVMQA: support‐vector‐machine‐based protein single‐model quality assessment , 2017, Bioinform..

[11]  Roland L. Dunbrack,et al.  Assessment of disorder predictions in CASP6 , 2005, Proteins.

[12]  S. Khan,et al.  Unb-DPC: Identify mycobacterial membrane protein types by incorporating un-biased dipeptide composition into Chou's general PseAAC. , 2017, Journal of theoretical biology.

[13]  K. Chou,et al.  Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms , 2008, Nature Protocols.

[14]  J. Stamatoyannopoulos,et al.  High-throughput localization of functional elements by quantitative chromatin profiling , 2004, Nature Methods.

[15]  K. Chou,et al.  Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences. , 2015, Molecular bioSystems.

[16]  Wei Chen,et al.  iRNA-AI: identifying the adenosine to inosine editing sites in RNA sequences , 2016, Oncotarget.

[17]  Vineet K. Sharma,et al.  A novel approach for the prediction of species-specific biotransformation of xenobiotic/drug molecules by the human gut microbiota , 2017, Scientific Reports.

[18]  Vladimir Vapnik,et al.  An overview of statistical learning theory , 1999, IEEE Trans. Neural Networks.

[19]  Pengmian Feng,et al.  Prediction of DNase I Hypersensitive Sites by Using Pseudo Nucleotide Compositions , 2014, TheScientificWorldJournal.

[20]  Zheng Rong Yang,et al.  RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins , 2005, Bioinform..

[21]  Atsuhiko Shinmyo,et al.  Mapping and characterization of DNase I hypersensitive sites in Arabidopsis chromatin. , 2007, Plant & cell physiology.

[22]  Hua Tang,et al.  A two-step discriminated method to identify thermophilic proteins , 2017 .

[23]  Xing-Ming Zhao,et al.  FunSAV: Predicting the Functional Effect of Single Amino Acid Variants Using a Two-Stage Random Forest Model , 2012, PloS one.

[24]  Sangdun Choi,et al.  In Silico Approach to Inhibition of Signaling Pathways of Toll-Like Receptors 2 and 4 by ST2L , 2011, PloS one.

[25]  Sangdun Choi,et al.  Molecular modeling‐based evaluation of dual function of IκBζ ankyrin repeat domain in toll‐like receptor signaling , 2011, Journal of molecular recognition : JMR.

[26]  Sangdun Choi,et al.  Comparative Analysis of Species-Specific Ligand Recognition in Toll-Like Receptor 8 Signaling: A Hypothesis , 2011, PloS one.

[27]  Giorgio Bernardi,et al.  Gene-rich and gene-poor chromosomal regions have different locations in the interphase nuclei of cold-blooded vertebrates , 2006, Chromosoma.

[28]  William Stafford Noble,et al.  Predicting the in vivo signature of human gene regulatory sequence , 2005, ISMB.

[29]  Ren Long,et al.  iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework , 2016, Bioinform..

[30]  Jie Hou,et al.  DeepQA: improving the estimation of single protein model quality with deep belief networks , 2016, BMC Bioinformatics.

[31]  Kuo-Chen Chou,et al.  iATC-mHyb: a hybrid multi-label classifier for predicting the classification of anatomical therapeutic chemicals , 2017, Oncotarget.

[32]  G. Crawford,et al.  DNase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells. , 2010, Cold Spring Harbor protocols.

[33]  Jooyoung Lee,et al.  Structure-based protein folding type classification and folding rate prediction , 2015, 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[34]  Geoffrey I. Webb,et al.  GlycoMine: a machine learning-based approach for predicting N-, C- and O-linked glycosylation in the human proteome , 2015, Bioinform..

[35]  D. S. Gross,et al.  Nuclease hypersensitive sites in chromatin. , 1988, Annual review of biochemistry.

[36]  Renzhi Cao,et al.  SMOQ: a tool for predicting the absolute residue-specific quality of a single protein model with support vector machines , 2013, BMC Bioinformatics.

[37]  Kuo-Chen Chou,et al.  Some remarks on predicting multi-label attributes in molecular biosystems. , 2013, Molecular bioSystems.

[38]  Wei Chen,et al.  iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition , 2013, Nucleic acids research.

[39]  Torsten Schwede,et al.  Assessment of model accuracy estimations in CASP12 , 2018, Proteins.

[40]  Giorgio Bernardi,et al.  Mapping DNase-I hypersensitive sites on human isochores. , 2008, Gene.

[41]  Giorgio Bernardi,et al.  Localization of the gene-richest and the gene-poorest isochores in the interphase nuclei of mammals and birds. , 2002, Gene.

[42]  Sarah C. R. Elgin,et al.  The chromatin structure of specific genes: I. Evidence for higher order domains of defined DNA sequence , 1979, Cell.

[43]  K. Chou,et al.  iLoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites. , 2012, Molecular bioSystems.

[44]  M. Daly,et al.  Genome-wide mapping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS). , 2005, Genome research.

[45]  G. Felsenfeld,et al.  Chromatin as an essential part of the transcriptional mechanim , 1992, Nature.

[46]  Renzhi Cao,et al.  Protein single-model quality assessment by feature-based probability density functions , 2016, Scientific Reports.

[47]  Arne Elofsson,et al.  ProQ3: Improved model quality assessments using Rosetta energy terms , 2016, Scientific Reports.

[48]  Sangdun Choi,et al.  Molecular modeling of the reductase domain to elucidate the reaction mechanism of reduction of peptidyl thioester into its corresponding alcohol in non-ribosomal peptide synthetases , 2010, BMC Structural Biology.

[49]  Kuo-Chen Chou,et al.  Prediction of protein signal sequences. , 2002, Current protein & peptide science.

[50]  Thomas A. Down,et al.  Chromatin Accessibility Data Sets Show Bias Due to Sequence Specificity of the DNase I Enzyme , 2013, PloS one.

[51]  Sangdun Choi,et al.  Structure-Function Relationship of Cytoplasmic and Nuclear IκB Proteins: An In Silico Analysis , 2010, PloS one.

[52]  K. Chou,et al.  iRNA-PseColl: Identifying the Occurrence Sites of Different RNA Modifications by Incorporating Collective Effects of Nucleotides into PseKNC , 2017, Molecular therapy. Nucleic acids.

[53]  Wei Chen,et al.  iRNA-PseU: Identifying RNA pseudouridine sites , 2016, Molecular therapy. Nucleic acids.

[54]  Wei Chen,et al.  iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition , 2014, Nucleic acids research.

[55]  Balachandran Manavalan,et al.  Random Forest-Based Protein Model Quality Assessment (RFMQA) Using Structural Features and Potential Energy Terms , 2014, PloS one.

[56]  Wei Chen,et al.  Identification of bacteriophage virion proteins by the ANOVA feature selection and analysis. , 2014, Molecular bioSystems.

[57]  Kuo-Chen Chou,et al.  iPTM-mLys: identifying multiple lysine PTM sites and their different types , 2016, Bioinform..

[58]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[59]  Kuo-Chen Chou,et al.  2L-piRNA: A Two-Layer Ensemble Classifier for Identifying Piwi-Interacting RNAs and Their Function , 2017, Molecular therapy. Nucleic acids.

[60]  M. Groudine,et al.  Controlling the double helix , 2003, Nature.

[61]  Prabina Kumar Meher,et al.  Predicting antimicrobial peptides with improved accuracy by incorporating the compositional, physico-chemical and structural features into Chou’s general PseAAC , 2017, Scientific Reports.

[62]  Sangdun Choi,et al.  Evolutionary, Structural and Functional Interplay of the IκB Family Members , 2013, PloS one.

[63]  K. Chou Structural bioinformatics and its impact to biomedical science. , 2004, Current medicinal chemistry.

[64]  Jiangning Song,et al.  An Integrative Computational Framework Based on a Two-Step Random Forest Algorithm Improves Prediction of Zinc-Binding Sites in Proteins , 2012, PloS one.