Identify origin of replication in Saccharomyces cerevisiae using two-step feature selection technique

MOTIVATION DNA replication is a key step to maintain the continuity of genetic information between parental generation and offspring. The initiation site of DNA replication, also called origin of replication (ORI), plays an extremely important role in the basic biochemical process. Thus, rapidly and effectively identifying the location of ORI in genome will provide key clues for genome analysis. Although biochemical experiments could provide detailed information for ORI, it requires high experimental cost and long experimental period. As good complements to experimental techniques, computational methods could overcome these disadvantages. RESULTS Thus, in this study, we developed a predictor called iORI-PseKNC2.0 to identify ORIs in the Saccharomyces cerevisiae genome based on sequence information. The PseKNC including 90 physicochemical properties was proposed to formulate ORI and non-ORI samples. In order to improve the accuracy, a two-step feature selection was proposed to exclude redundant and noise information. As a result, the overall success rate of 88.53% was achieved in the 5-fold cross-validation test by using support vector machine. AVAILABILITY AND IMPLEMENTATION Based on the proposed model, a user-friendly webserver was established and can be freely accessed at http://lin-group.cn/server/iORI-PseKNC2.0. The webserver will provide more convenience to most of wet-experimental scholars.

[1]  Ieee Xplore,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Information for Authors , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[3]  Hao Lin,et al.  Identifying Sigma70 Promoters with Novel Pseudo Nucleotide Composition , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[4]  Yue Zhao,et al.  MNDR v2.0: an updated resource of ncRNA–disease associations in mammals , 2017, Nucleic Acids Res..

[5]  Wei Chen,et al.  Predicting the subcellular localization of mycobacterial proteins by incorporating the optimal tripeptides into the general form of pseudo amino acid composition. , 2015, Molecular bioSystems.

[6]  Wei Chen,et al.  iOri-Human: identify human origin of replication by incorporating dinucleotide physicochemical properties into pseudo nucleotide composition , 2016, Oncotarget.

[7]  Renzhi Cao,et al.  Survey of Machine Learning Techniques in Drug Discovery. , 2019, Current drug metabolism.

[8]  Ashutosh Kumar,et al.  Nuclear gyrB encodes a functional subunit of the Plasmodium falciparum gyrase that is involved in apicoplast DNA replication. , 2007, Molecular and biochemical parasitology.

[9]  Jie Hou,et al.  DeepQA: improving the estimation of single protein model quality with deep belief networks , 2016, BMC Bioinformatics.

[10]  Wei Chen,et al.  iORI-PseKNC: A predictor for identifying origin of replication with pseudo k-tuple nucleotide composition , 2015 .

[11]  Wei Chen,et al.  PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions , 2015, Bioinform..

[12]  Geoffrey I. Webb,et al.  Accurate in silico identification of species-specific acetylation sites by integrating protein sequence-derived and functional features , 2014, Scientific Reports.

[13]  K. Chou,et al.  PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. , 2014, Analytical biochemistry.

[14]  Wei Chen,et al.  Prediction of replication origins by calculating DNA structural properties , 2012, FEBS letters.

[15]  Wei Chen,et al.  iDNA4mC: identifying DNA N4‐methylcytosine sites based on nucleotide chemical properties , 2017, Bioinform..

[16]  Jooyoung Lee,et al.  SVMQA: support‐vector‐machine‐based protein single‐model quality assessment , 2017, Bioinform..

[17]  R. W. Smith,et al.  Multiple phosphorylation sites of DNA polymerase alpha-primase cooperate to regulate the initiation of DNA replication in vitro. , 2001, The Journal of biological chemistry.

[18]  D. Soldati,et al.  The apicoplast as a potential therapeutic target in and other apicomplexan parasites. , 1999, Parasitology today.

[19]  Rong Chen,et al.  HBPred: a tool to identify growth hormone-binding proteins , 2018, International journal of biological sciences.

[20]  J.C. Rajapakse,et al.  SVM-RFE With MRMR Filter for Gene Selection , 2010, IEEE Transactions on NanoBioscience.

[21]  C E Metz,et al.  Some practical issues of experimental design and data analysis in radiological ROC studies. , 1989, Investigative radiology.

[22]  Dagmara Jakimowicz,et al.  Regulation of the initiation of chromosomal replication in bacteria. , 2007, FEMS microbiology reviews.

[23]  Yan Huang,et al.  RNALocate: a resource for RNA subcellular localizations , 2016, Nucleic Acids Res..

[24]  Modesto Orozco,et al.  DNAlive: a tool for the physical analysis of DNA at the genomic scale , 2008, Bioinform..

[25]  Hua Tang,et al.  Identification of Bacterial Cell Wall Lyases via Pseudo Amino Acid Composition , 2016, BioMed research international.

[26]  Yucong Duan,et al.  70ProPred: a predictor for discovering sigma70 promoters based on combining multiple features , 2018, BMC Syst. Biol..

[27]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[28]  Conrad A. Nieduszynski,et al.  Genome-wide identification of replication origins in yeast by comparative genomics. , 2006, Genes & development.

[29]  M. Méchali,et al.  Eukaryotic DNA replication origins: many choices for appropriate answers , 2010, Nature Reviews Molecular Cell Biology.

[30]  Wei Chen,et al.  Sequence-based predictive modeling to identify cancerlectins , 2017, Oncotarget.

[31]  Annangarachari Krishnamachari,et al.  Nucleotide correlation based measure for identifying origin of replication in genomic sequences , 2012, Biosyst..

[32]  De-Shuang Huang,et al.  iRO-3wPseKNC: identify DNA replication origins by three-window-based PseKNC , 2018, Bioinform..

[33]  J. Diffley,et al.  Initiation complex assembly at budding yeast replication origins begins with the recognition of a bipartite sequence by limiting amounts of the initiator, ORC. , 1995, The EMBO journal.

[34]  Modesto Orozco,et al.  Determining promoter location based on DNA structure first-principles calculations , 2007, Genome Biology.

[35]  Hui Ding,et al.  Prediction of bacteriophage proteins located in the host cell using hybrid features , 2018, Chemometrics and Intelligent Laboratory Systems.

[36]  Chih-Jen Lin,et al.  Asymptotic Behaviors of Support Vector Machines with , 2003 .

[37]  K. Chou,et al.  Prediction of protein structural classes. , 1995, Critical reviews in biochemistry and molecular biology.

[38]  Feng Gao,et al.  Ori-Finder: A web-based system for finding oriCs in unannotated bacterial genomes , 2008, BMC Bioinformatics.

[39]  D. Roos,et al.  Apicomplexan plastids as drug targets. , 1999, Trends in microbiology.

[40]  Gwang Lee,et al.  PVP-SVM: Sequence-Based Prediction of Phage Virion Proteins Using a Support Vector Machine , 2018, Front. Microbiol..

[41]  S. Bell,et al.  Architecture of the yeast origin recognition complex bound to origins of DNA replication , 1997, Molecular and cellular biology.

[42]  Cangzhi Jia,et al.  EnhancerPred: a predictor for discovering enhancers based on the combination and selection of multiple features , 2016, Scientific Reports.

[43]  S. Kaul,et al.  Structure, replication efficiency and fragility of yeast ARS elements. , 2012, Research in microbiology.

[44]  B. Stillman,et al.  The origin recognition complex interacts with a bipartite DNA binding site within yeast replicators. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[45]  Geoffrey I. Webb,et al.  GlycoMinestruct: a new bioinformatics tool for highly accurate mapping of the human N-linked and O-linked glycoproteomes by incorporating structural features , 2016, Scientific Reports.

[46]  Vipin Kumar,et al.  Prediction of replication sites in Saccharomyces cerevisiae genome using DNA segment properties: Multi-view ensemble learning (MEL) approach , 2018, Biosyst..

[47]  Nicolas Papon,et al.  Characterization of an autonomously replicating sequence in Candida guilliermondii. , 2013, Microbiological research.

[48]  Jiangning Song,et al.  ACPred-FL: a sequence-based predictor using effective feature representation to improve the prediction of anti-cancer peptides , 2018, Bioinform..

[49]  Geoffrey I. Webb,et al.  iProt-Sub: a comprehensive package for accurately mapping and predicting protease-specific substrates and cleavage sites , 2018, Briefings Bioinform..

[50]  Wei Chen,et al.  iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition , 2014, Bioinform..

[51]  Dong Wang,et al.  iLoc‐lncRNA: predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC , 2018, Bioinform..

[52]  Heather K MacAlpine,et al.  Genome-wide localization of replication factors. , 2012, Methods.

[53]  Hua Tang,et al.  Identification of Secretory Proteins in Mycobacterium tuberculosis Using Pseudo Amino Acid Composition , 2016, BioMed research international.

[54]  Feng Gao,et al.  Ori-Finder 2, an integrated tool to predict replication origins in the archaeal genomes , 2014, Front. Microbiol..

[55]  Wei Chen,et al.  iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition , 2014, Nucleic acids research.

[56]  K. Chou,et al.  iSS-PseDNC: Identifying Splicing Sites Using Pseudo Dinucleotide Composition , 2014, BioMed research international.

[57]  Chengcheng Song,et al.  Choosing a suitable method for the identification of replication origins in microbial genomes , 2015, Front. Microbiol..

[58]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[59]  Wei Tang,et al.  Tumor origin detection with tissue‐specific miRNA and DNA methylation markers , 2018, Bioinform..

[60]  Wei Chen,et al.  Predicting protein structural classes for low-similarity sequences by evaluating different features , 2019, Knowl. Based Syst..

[61]  Wei Chen,et al.  Identification of mycobacterial membrane proteins and their types using over-represented tripeptide compositions. , 2012, Journal of proteomics.

[62]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[63]  Chih-Jen Lin,et al.  Asymptotic Behaviors of Support Vector Machines with Gaussian Kernel , 2003, Neural Computation.

[64]  Wei Chen,et al.  iRNA-2OM: A Sequence-Based Predictor for Identifying 2′-O-Methylation Sites in Homo sapiens , 2018, J. Comput. Biol..

[65]  Miao Sun,et al.  QAcon: single model quality assessment using protein structural and contact information with machine learning techniques , 2016, Bioinform..

[66]  Wei Chen,et al.  iTIS-PseTNC: a sequence-based predictor for identifying translation initiation site in human genes using pseudo trinucleotide composition. , 2014, Analytical biochemistry.

[67]  D. Soldati,et al.  The Apicoplast as a Potential Therapeutic Target in Toxoplasma and Other Apicomplexan Parasites , 1999 .

[68]  Kuo-Chen Chou,et al.  iRSpot-Pse6NC: Identifying recombination spots in Saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC , 2018, International journal of biological sciences.

[69]  Zhangxin Chen,et al.  ProLanGO: Protein Function Prediction Using Neural Machine Translation Based on a Recurrent Neural Network , 2017, Molecules.

[70]  Hui Ding,et al.  Prediction of the types of ion channel-targeted conotoxins based on radial basis function network. , 2013, Toxicology in vitro : an international journal published in association with BIBRA.

[71]  Jiangning Song,et al.  Quokka: a comprehensive tool for rapid and accurate prediction of kinase family‐specific phosphorylation sites in the human proteome , 2018, Bioinform..

[72]  Gholamreza Haffari,et al.  PROSPERous: high-throughput prediction of substrate cleavage sites for 90 proteases with improved accuracy , 2018, Bioinform..

[73]  Craig J. Benham,et al.  OriDB: a DNA replication origin database , 2006, Nucleic Acids Res..

[74]  Yan Lin,et al.  iTerm-PseKNC: a sequence-based tool for predicting bacterial transcriptional terminators , 2018, Bioinform..

[75]  Wei Chen,et al.  Pro54DB: a database for experimentally verified sigma‐54 promoters , 2016, Bioinform..

[76]  Hui Ding,et al.  Sequence analysis of origins of replication in the Saccharomyces cerevisiae genomes , 2014, Front. Microbiol..

[77]  Xing-Ming Zhao,et al.  FunSAV: Predicting the Functional Effect of Single Amino Acid Variants Using a Two-Stage Random Forest Model , 2012, PloS one.

[78]  Vincent Miele,et al.  DNA physical properties determine nucleosome occupancy from yeast to fly , 2008, Nucleic acids research.

[79]  L. Shapiro,et al.  Bacterial chromosome origins of replication. , 1993, Current opinion in genetics & development.

[80]  Yue Zhao,et al.  RAID v2.0: an updated resource of RNA-associated interactions across organisms , 2016, Nucleic Acids Res..

[81]  Hua Tang,et al.  A two-step discriminated method to identify thermophilic proteins , 2017 .

[82]  Jianxin Li,et al.  Analysis and Modeling for Big Data in Cancer Research , 2017, BioMed research international.

[83]  Geoffrey I. Webb,et al.  GlycoMine: a machine learning-based approach for predicting N-, C- and O-linked glycosylation in the human proteome , 2015, Bioinform..

[84]  Nicola Smania,et al.  Use of NeuroEyeCoach™ to Improve Eye Movement Efficacy in Patients with Homonymous Visual Field Loss , 2016, BioMed research international.

[85]  Ying Ju,et al.  Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy , 2016, BMC Systems Biology.

[86]  Kuo-Chen Chou,et al.  iROS-gPseKNC: Predicting replication origin sites in DNA by incorporating dinucleotide position-specific propensity into general pseudo nucleotide composition , 2016, Oncotarget.

[87]  Yuan Lu-fen Prediction of the Types of Ion Channel-Targeted Conotoxins Based on Feature Selection Techniques , 2013 .

[88]  Ning Li,et al.  PSBinder: A Web Service for Predicting Polystyrene Surface-Binding Peptides , 2017, BioMed research international.