Using extreme gradient boosting to identify origin of replication in Saccharomyces cerevisiae via hybrid features.

DNA replication is a fundamental task that plays a crucial role in the propagation of all living things on earth. Hence, the accurate identification of its origin could be the key to giving an insightful understanding of the regulatory mechanism of gene expression. Indeed, with the robust development of computational techniques and the abundant biological sequencing data, it has become possible for scientists to identify the origin of replication accurately and promptly. This growing concern has drawn a lot of attention among experts in this field. However, to gain better outcomes, more work is required. Therefore, this study is designed to explore the combination of state-of-the-art features and extreme gradient boosting learning system in classifying DNA sequences. Our hybrid approach is able to identify the origin of DNA replication with achieved sensitivity of 85.19%, specificity of 93.83%, accuracy of 89.51%, and MCC of 0.7931. Evidence is presented to show that our proposed method is superior to the state-of-the-art methods on the same benchmark dataset. Moreover, the research results represent a further step towards developing the prediction models for DNA replication in particular and DNA sequences in general.

[1]  Tuan-Tu Huynh,et al.  Identification of clathrin proteins by incorporating hyperparameter optimization in deep learning and PSSM profiles , 2019, Comput. Methods Programs Biomed..

[2]  D. Roos,et al.  Apicomplexan plastids as drug targets. , 1999, Trends in microbiology.

[3]  K. Chou,et al.  iKcr-PseEns: Identify lysine crotonylation sites in histone proteins with pseudo components and ensemble classifier. , 2017, Genomics.

[4]  Nicolas Papon,et al.  Characterization of an autonomously replicating sequence in Candida guilliermondii. , 2013, Microbiological research.

[5]  Feng Gao,et al.  Recent advances in the genome-wide study of DNA replication origins in yeast , 2015, Front. Microbiol..

[6]  B. Stillman,et al.  The origin recognition complex interacts with a bipartite DNA binding site within yeast replicators. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Yu-Yen Ou,et al.  iEnhancer-5Step: Identifying enhancers using hidden information of DNA sequences via Chou's 5-step rule and word embedding. , 2019, Analytical biochemistry.

[8]  G. Cooper The Cell: A Molecular Approach , 1996 .

[9]  Heather K MacAlpine,et al.  Genome-wide localization of replication factors. , 2012, Methods.

[10]  Nguyen Quoc Khanh Le,et al.  A sequence-based approach for identifying recombination spots in Saccharomyces cerevisiae by using hyper-parameter optimization in FastText and support vector machine , 2019, Chemometrics and Intelligent Laboratory Systems.

[11]  Conrad A. Nieduszynski,et al.  Genome-wide identification of replication origins in yeast by comparative genomics. , 2006, Genes & development.

[12]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[13]  Edith D. Wong,et al.  Saccharomyces Genome Database: the genomics resource of budding yeast , 2011, Nucleic Acids Res..

[14]  Hao Lv,et al.  Identify origin of replication in Saccharomyces cerevisiae using two-step feature selection technique , 2018, Bioinform..

[15]  Craig J. Benham,et al.  OriDB: a DNA replication origin database , 2006, Nucleic Acids Res..

[16]  J. Diffley,et al.  Initiation complex assembly at budding yeast replication origins begins with the recognition of a bipartite sequence by limiting amounts of the initiator, ORC. , 1995, The EMBO journal.

[17]  Wei Chen,et al.  Prediction of replication origins by calculating DNA structural properties , 2012, FEBS letters.

[18]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[19]  De-Shuang Huang,et al.  iRO-3wPseKNC: identify DNA replication origins by three-window-based PseKNC , 2018, Bioinform..

[20]  S. Bell,et al.  Architecture of the yeast origin recognition complex bound to origins of DNA replication , 1997, Molecular and cellular biology.

[21]  S. Kaul,et al.  Structure, replication efficiency and fragility of yeast ARS elements. , 2012, Research in microbiology.

[22]  Yu-Yen Ou,et al.  iMotor-CNN: Identifying molecular functions of cytoskeleton motor proteins using 2D convolutional neural network via Chou's 5-step rule. , 2019, Analytical biochemistry.

[23]  Xiaohui Xie,et al.  HLA class I binding prediction via convolutional neural networks , 2017, bioRxiv.

[24]  K. Chou,et al.  iPSW(2L)-PseKNC: A two-layer predictor for identifying promoters and their strength by hybrid features via pseudo K-tuple nucleotide composition. , 2019, Genomics.

[25]  Kuo-Chen Chou,et al.  iROS-gPseKNC: Predicting replication origin sites in DNA by incorporating dinucleotide position-specific propensity into general pseudo nucleotide composition , 2016, Oncotarget.

[26]  Annangarachari Krishnamachari,et al.  Nucleotide correlation based measure for identifying origin of replication in genomic sequences , 2012, Biosyst..

[27]  Feng Gao,et al.  DoriC: a database of oriC regions in bacterial genomes , 2007, Bioinform..

[28]  Aaron Bensimon,et al.  DNA replication origins fire stochastically in fission yeast. , 2005, Molecular biology of the cell.

[29]  D. Soldati,et al.  The Apicoplast as a Potential Therapeutic Target in Toxoplasma and Other Apicomplexan Parasites , 1999 .

[30]  N. Le iN6-methylat (5-step): identifying DNA N6-methyladenine sites in rice genome using continuous bag of nucleobases via Chou’s 5-step rule , 2019, Molecular Genetics and Genomics.

[31]  Bruce Stillman,et al.  Assembly of a Complex Containing Cdc45p, Replication Protein A, and Mcm2p at Replication Origins Controlled by S-Phase Cyclin-Dependent Kinases and Cdc7p-Dbf4p Kinase , 2000, Molecular and Cellular Biology.

[32]  Muhammad Kabir,et al.  Predicting DNase I hypersensitive sites via un-biased pseudo trinucleotide composition , 2017 .

[33]  Kenta Nakai,et al.  Genome-wide characterization of transcriptional start sites in humans by integrative transcriptome analysis. , 2011, Genome research.

[34]  K. Chou,et al.  iDNA6mA-PseKNC: Identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. , 2018, Genomics.

[35]  Hui Ding,et al.  Sequence analysis of origins of replication in the Saccharomyces cerevisiae genomes , 2014, Front. Microbiol..

[36]  Ashutosh Kumar,et al.  Nuclear gyrB encodes a functional subunit of the Plasmodium falciparum gyrase that is involved in apicoplast DNA replication. , 2007, Molecular and biochemical parasitology.

[37]  Victor G. Levitsky,et al.  NPRD: Nucleosome Positioning Region Database , 2004, Nucleic Acids Res..

[38]  Chengcheng Song,et al.  Choosing a suitable method for the identification of replication origins in microbial genomes , 2015, Front. Microbiol..

[39]  B. Stillman,et al.  The DNA replication fork in eukaryotic cells. , 1998, Annual review of biochemistry.

[40]  Dechang Pi,et al.  iRSpot-SPI: Deep learning-based recombination spots prediction by incorporating secondary sequence information coupled with physio-chemical properties via Chou's 5-step rule and pseudo components , 2019, Chemometrics and Intelligent Laboratory Systems.

[41]  Sourav Chatterji,et al.  Prediction of Saccharomyces cerevisiae replication origins , 2004, Genome Biology.

[42]  Wei Chen,et al.  iORI-PseKNC: A predictor for identifying origin of replication with pseudo k-tuple nucleotide composition , 2015 .

[43]  L. Shapiro,et al.  Bacterial chromosome origins of replication. , 1993, Current opinion in genetics & development.

[44]  K. Chou,et al.  PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. , 2014, Analytical biochemistry.

[45]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[46]  Wei Chen,et al.  iOri-Human: identify human origin of replication by incorporating dinucleotide physicochemical properties into pseudo nucleotide composition , 2016, Oncotarget.