pDHS-SVM: A prediction method for plant DNase I hypersensitive sites based on support vector machine.

DNase I hypersensitive sites (DHSs) are accessible chromatin regions hypersensitive to cleavages by DNase I endonucleases. DHSs are indicative of cis-regulatory DNA elements (CREs), all of which play important roles in global gene expression regulation. It is helpful for discovering CREs by recognition of DHSs in genome. To accelerate the investigation, it is an important complement to develop cost-effective computational methods to identify DHSs. However, there is a lack of tools used for identifying DHSs in plant genome. Here we presented pDHS-SVM, a computational predictor to identify plant DHSs. To integrate the global sequence-order information and local DNA properties, reverse complement kmer and dinucleotide-based auto covariance of DNA sequences were applied to construct the feature space. In this work, fifteen physical-chemical properties of dinucleotides were used and Support Vector Machine (SVM) was employed. To further improve the performance of the predictor and extract an optimized subset of nucleotide physical-chemical properties positive for the DHSs, a heuristic nucleotide physical-chemical property selection algorithm was introduced. With the optimized subset of properties, experimental results of Arabidopsis thaliana and rice (Oryza sativa) showed that pDHS-SVM could achieve accuracies up to 87.00%, and 85.79%, respectively. The results indicated the effectiveness of proposed method for predicting DHSs. Furthermore, pDHS-SVM could provide a helpful complement for predicting CREs in plant genome. Our implementation of the novel proposed method pDHS-SVM is freely available as source code, at https://github.com/shanxinzhang/pDHS-SVM.

[1]  S. Khan,et al.  Unb-DPC: Identify mycobacterial membrane protein types by incorporating un-biased dipeptide composition into Chou's general PseAAC. , 2017, Journal of theoretical biology.

[2]  Pedro Madrigal,et al.  Current bioinformatic approaches to identify DNase I hypersensitive sites and genomic footprints from DNase-seq data , 2012, Front. Gene..

[3]  Ran Su,et al.  Identifying N6-methyladenosine sites using multi-interval nucleotide pair position specificity and support vector machine , 2017, Scientific Reports.

[4]  K. Chou,et al.  iLoc-Euk: A Multi-Label Classifier for Predicting the Subcellular Localization of Singleplex and Multiplex Eukaryotic Proteins , 2011, PloS one.

[5]  Wei Chen,et al.  iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition , 2014, Nucleic acids research.

[6]  Wei Chen,et al.  iNuc-PhysChem: A Sequence-Based Predictor for Identifying Nucleosomes via Physicochemical Properties , 2012, PloS one.

[7]  Dong Xu,et al.  iPhos‐PseEvo: Identifying Human Phosphorylated Proteins by Incorporating Evolutionary Information into General PseAAC via Grey System Theory , 2017, Molecular informatics.

[8]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[9]  Maqsood Hayat,et al.  iNuc-STNC: a sequence-based predictor for identification of nucleosome positioning in genomes by extending the concept of SAAC and Chou's PseAAC. , 2016, Molecular bioSystems.

[10]  M. Daly,et al.  Genome-wide mapping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS). , 2005, Genome research.

[11]  B. Liu,et al.  iRSpot-DACC: a computational predictor for recombination hot/cold spots identification based on dinucleotide-based auto-cross covariance , 2016, Scientific Reports.

[12]  K. Chou,et al.  iSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins , 2013, PeerJ.

[13]  Tao Zhang,et al.  PlantDHS: a database for DNase I hypersensitive sites in plants , 2015, Nucleic Acids Res..

[14]  Pengmian Feng,et al.  Prediction of DNase I Hypersensitive Sites by Using Pseudo Nucleotide Compositions , 2014, TheScientificWorldJournal.

[15]  James C. Schnable,et al.  High-resolution mapping of open chromatin in the rice genome. , 2012, Genome research.

[16]  M. Bakhtiarizadeh,et al.  OOgenesis_Pred: A sequence-based method for predicting oogenesis proteins by six different modes of Chou's pseudo amino acid composition. , 2017, Journal of theoretical biology.

[17]  Ren Long,et al.  iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition , 2016, Bioinform..

[18]  Wei Chen,et al.  iOri-Human: identify human origin of replication by incorporating dinucleotide physicochemical properties into pseudo nucleotide composition , 2016, Oncotarget.

[19]  Loris Nanni,et al.  Prediction of protein structure classes by incorporating different protein descriptors into general Chou's pseudo amino acid composition. , 2014, Journal of theoretical biology.

[20]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[21]  K. Chou,et al.  iLoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites. , 2012, Molecular bioSystems.

[22]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[23]  Z. Weng,et al.  High-Resolution Mapping and Characterization of Open Chromatin across the Genome , 2008, Cell.

[24]  J. Stamatoyannopoulos,et al.  Chromatin accessibility pre-determines glucocorticoid receptor binding patterns , 2011, Nature Genetics.

[25]  Kuo-Chen Chou,et al.  iHyd-PseCp: Identify hydroxyproline and hydroxylysine in proteins by incorporating sequence-coupled effects into general PseAAC , 2016, Oncotarget.

[26]  Roland L. Dunbrack,et al.  Assessment of disorder predictions in CASP6 , 2005, Proteins.

[27]  Tao Zhang,et al.  Genome-Wide Identification of Regulatory DNA Elements and Protein-Binding Footprints Using Signatures of Open Chromatin in Arabidopsis[C][W][OA] , 2012, Plant Cell.

[28]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[29]  K. Chou,et al.  pRNAm-PC: Predicting N(6)-methyladenosine sites in RNA sequences via physical-chemical properties. , 2016, Analytical biochemistry.

[30]  Kuo-Chen Chou,et al.  iATC‐mISF: a multi‐label classifier for predicting the classes of anatomical therapeutic chemicals , 2016, Bioinform..

[31]  Ren Long,et al.  dRHP-PseRA: detecting remote homology proteins using profile-based pseudo protein sequence and rank aggregation , 2016, Scientific Reports.

[32]  K. Chou Using subsite coupling to predict signal peptides. , 2001, Protein engineering.

[33]  Ming Zhang,et al.  Improving N(6)-methyladenosine site prediction with heuristic selection of nucleotide physical-chemical properties. , 2016, Analytical biochemistry.

[34]  J. Sklar,et al.  Genome-wide Detection of DNase I Hypersensitive Sites in Single Cells and FFPE Samples , 2015, Nature.

[35]  Wei Chen,et al.  iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition , 2014, Bioinform..

[36]  James G. Lyons,et al.  Gram-positive and Gram-negative protein subcellular localization by incorporating evolutionary-based descriptors into Chou׳s general PseAAC. , 2015, Journal of theoretical biology.

[37]  K. Chou,et al.  iCar-PseCp: identify carbonylation sites in proteins by Monte Carlo sampling and incorporating sequence coupled effects into general PseAAC , 2016, Oncotarget.

[38]  Wei Chen,et al.  iRNA-PseU: Identifying RNA pseudouridine sites , 2016, Molecular therapy. Nucleic acids.

[39]  William Stafford Noble,et al.  Predicting the in vivo signature of human gene regulatory sequence , 2005, ISMB.

[40]  K. Chou Pseudo Amino Acid Composition and its Applications in Bioinformatics, Proteomics and System Biology , 2009 .

[41]  Kuo-Chen Chou,et al.  iPhos-PseEn: Identifying phosphorylation sites in proteins by fusing different pseudo components into an ensemble classifier , 2016, Oncotarget.

[42]  Kuo-Chen Chou,et al.  iPTM-mLys: identifying multiple lysine PTM sites and their different types , 2016, Bioinform..

[43]  Yong-qiang Xing,et al.  Using weighted features to predict recombination hotspots in Saccharomyces cerevisiae. , 2015, Journal of theoretical biology.

[44]  G. Crawford,et al.  DNase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells. , 2010, Cold Spring Harbor protocols.

[45]  Atsuhiko Shinmyo,et al.  Mapping and characterization of DNase I hypersensitive sites in Arabidopsis chromatin. , 2007, Plant & cell physiology.

[46]  Kuo-Chen Chou,et al.  iSuc-PseOpt: Identifying lysine succinylation sites in proteins by incorporating sequence-coupling effects into pseudo components and optimizing imbalanced training dataset. , 2016, Analytical biochemistry.

[47]  K. Chou,et al.  iAMP-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional types. , 2013, Analytical biochemistry.

[48]  K. Chou,et al.  iLoc-Plant: a multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites. , 2011, Molecular bioSystems.

[49]  Kuo-Chen Chou,et al.  Some remarks on predicting multi-label attributes in molecular biosystems. , 2013, Molecular bioSystems.

[50]  Ren Long,et al.  iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework , 2016, Bioinform..

[51]  Wei Chen,et al.  iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition , 2013, Nucleic acids research.

[52]  K. Chou,et al.  iLoc-Animal: a multi-label learning classifier for predicting subcellular localization of animal proteins. , 2013, Molecular bioSystems.

[53]  K. Chou Impacts of bioinformatics to medicinal chemistry. , 2015, Medicinal chemistry (Shariqah (United Arab Emirates)).

[54]  Shane J. Neph,et al.  Mapping and dynamics of regulatory DNA and transcription factor networks in A. thaliana. , 2014, Cell reports.

[55]  Wei Chen,et al.  iRNA-AI: identifying the adenosine to inosine editing sites in RNA sequences , 2016, Oncotarget.

[56]  B. Liu,et al.  Pse-Analysis: a python package for DNA/RNA and protein/peptide sequence analysis based on pseudo components and kernel methods , 2017, Oncotarget.

[57]  Hui Ding,et al.  Using deformation energy to analyze nucleosome positioning in genomes. , 2016, Genomics.

[58]  Kuo-Chen Chou,et al.  iROS-gPseKNC: Predicting replication origin sites in DNA by incorporating dinucleotide position-specific propensity into general pseudo nucleotide composition , 2016, Oncotarget.

[59]  Kuo-Chen Chou,et al.  pSuc-Lys: Predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach. , 2016, Journal of theoretical biology.

[60]  K. Chou,et al.  iACP: a sequence-based tool for identifying anticancer peptides , 2016, Oncotarget.