Sc-ncDNAPred: A Sequence-Based Predictor for Identifying Non-coding DNA in Saccharomyces cerevisiae

With the rapid development of high-speed sequencing technologies and the implementation of many whole genome sequencing project, research in the genomics is advancing from genome sequencing to genome synthesis. Synthetic biology technologies such as DNA-based molecular assemblies, genome editing technology, directional evolution technology and DNA storage technology, and other cutting-edge technologies emerge in succession. Especially the rapid growth and development of DNA assembly technology may greatly push forward the success of artificial life. Meanwhile, DNA assembly technology needs a large number of target sequences of known information as data support. Non-coding DNA (ncDNA) sequences occupy most of the organism genomes, thus accurate recognizing of them is necessary. Although experimental methods have been proposed to detect ncDNA sequences, they are expensive for performing genome wide detections. Thus, it is necessary to develop machine-learning methods for predicting non-coding DNA sequences. In this study, we collected the ncDNA benchmark dataset of Saccharomyces cerevisiae and reported a support vector machine-based predictor, called Sc-ncDNAPred, for predicting ncDNA sequences. The optimal feature extraction strategy was selected from a group included mononucleotide, dimer, trimer, tetramer, pentamer, and hexamer, using support vector machine learning method. Sc-ncDNAPred achieved an overall accuracy of 0.98. For the convenience of users, an online web-server has been built at: http://server.malab.cn/Sc_ncDNAPred/index.jsp.

[1]  Q. Zou,et al.  Cancer Diagnosis Through IsomiR Expression with Machine Learning Method , 2016 .

[2]  Feng Liu,et al.  A genetic algorithm-based weighted ensemble method for predicting transposon-derived piRNAs , 2016, BMC Bioinformatics.

[3]  Yan Wang,et al.  “Perfect” designer chromosome V and behavior of a ring derivative , 2017, Science.

[4]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[5]  E B Baum,et al.  Building an associative memory vastly larger than the brain. , 1995, Science.

[6]  G. Church,et al.  CRISPR-Cas encoding of a digital movie into the genomes of a population of living bacteria , 2017, Nature.

[7]  Xiaolong Wang,et al.  Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection , 2013, Bioinform..

[8]  Bin Liu,et al.  BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches , 2019, Briefings Bioinform..

[9]  Qi Zhao,et al.  LPI-ETSLP: lncRNA-protein interaction prediction using eigenvalue transformation-based semi-supervised link prediction. , 2017, Molecular bioSystems.

[10]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[11]  Hua Tang,et al.  Identification of immunoglobulins using Chou's pseudo amino acid composition with feature selection technique. , 2016, Molecular bioSystems.

[12]  B. Liu,et al.  Pse-Analysis: a python package for DNA/RNA and protein/peptide sequence analysis based on pseudo components and kernel methods , 2017, Oncotarget.

[13]  R. Lechler,et al.  Splicing by overlap extension by PCR using asymmetric amplification: an improved technique for the generation of hybrid proteins of immunological interest. , 1997, Gene.

[14]  Kuo-Chen Chou,et al.  iRSpot-Pse6NC: Identifying recombination spots in Saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC , 2018, International journal of biological sciences.

[15]  Ping Wang,et al.  Discriminating ramos and jurkat cells with image textures from diffraction imaging flow cytometry based on a support vector machine , 2016 .

[16]  Ian H. Witten,et al.  Data mining in bioinformatics using Weka , 2004, Bioinform..

[17]  Xing Chen,et al.  PBMDA: A novel and effective path-based computational model for miRNA-disease association prediction , 2017, PLoS Comput. Biol..

[18]  Kuo-Chen Chou,et al.  iPTM-mLys: identifying multiple lysine PTM sites and their different types , 2016, Bioinform..

[19]  D. Schadendorf,et al.  Highly Recurrent TERT Promoter Mutations in Human Melanoma , 2022 .

[20]  Yi Xiong,et al.  Protein-protein interface hot spots prediction based on a hybrid feature selection strategy , 2018, BMC Bioinformatics.

[21]  Lei Deng,et al.  Accurate prediction of functional effects for variants by combining gradient tree boosting with optimal neighborhood properties , 2017, PloS one.

[22]  Ren Long,et al.  iRSpot-EL: identify recombination spots with an ensemble learning approach , 2017, Bioinform..

[23]  Peter A Carr,et al.  Genome engineering , 2009, Nature Biotechnology.

[24]  Feng Gao,et al.  Bug mapping and fitness testing of chemically synthesized chromosome X , 2017, Science.

[25]  Guobo Xie,et al.  LPI-IBNRA: Long Non-coding RNA-Protein Interaction Prediction Based on Improved Bipartite Network Recommender Algorithm , 2019, Front. Genet..

[26]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[27]  Ye Yu,et al.  A novel data structure to support ultra-fast taxonomic classification of metagenomic sequences with k-mer signatures , 2017, Bioinform..

[28]  Yi Xiong,et al.  An accurate feature‐based method for identifying DNA‐binding residues on protein surfaces , 2011, Proteins.

[29]  Stephen A. Billings,et al.  A new maximum relevance-minimum multicollinearity (MRmMC) method for feature selection and ranking , 2017, Pattern Recognit..

[30]  Herbert M. Sauro,et al.  In-Fusion BioBrick assembly and re-engineering , 2010, Nucleic acids research.

[31]  Carola Engler,et al.  Golden Gate Shuffling: A One-Pot DNA Shuffling Method Based on Type IIs Restriction Enzymes , 2009, PloS one.

[32]  Xing Chen,et al.  LRSSLMDA: Laplacian Regularized Sparse Subspace Learning for MiRNA-Disease Association prediction , 2017, PLoS Comput. Biol..

[33]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[34]  M. Gerstein,et al.  Role of non-coding sequence variants in cancer , 2016, Nature Reviews Genetics.

[35]  Hua Tang,et al.  Identification of Bacterial Cell Wall Lyases via Pseudo Amino Acid Composition , 2016, BioMed research international.

[36]  Meng Zhao,et al.  Predicting linear B-cell epitopes by using sequence-derived structural and physicochemical features , 2012, Int. J. Data Min. Bioinform..

[37]  Martyn D. Winn,et al.  K-mer clustering algorithm using a MapReduce framework: application to the parallelization of the Inchworm module of Trinity , 2017, BMC Bioinformatics.

[38]  Zhi-ping Wang,et al.  O-GlcNAcPRED: a sensitive predictor to capture protein O-GlcNAcylation sites. , 2013, Molecular bioSystems.

[39]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[40]  Rongrong Ji,et al.  Advanced learning for large-scale heterogeneous computing , 2016, Neurocomputing.

[41]  K. Chou,et al.  iACP: a sequence-based tool for identifying anticancer peptides , 2016, Oncotarget.

[42]  Sean R Eddy,et al.  The C-value paradox, junk DNA and ENCODE , 2012, Current Biology.

[43]  Xing Chen,et al.  Long non-coding RNAs and complex diseases: from experimental results to computational models , 2016, Briefings Bioinform..

[44]  Yu-Hua Yao,et al.  OH-PRED: prediction of protein hydroxylation sites by incorporating adapted normal distribution bi-profile Bayes feature extraction and physicochemical properties of amino acids , 2017, Journal of biomolecular structure & dynamics.

[45]  Xing Chen,et al.  EGBMMDA: Extreme Gradient Boosting Machine for MiRNA-Disease Association prediction , 2018, Cell Death & Disease.

[46]  Jingpu Zhang,et al.  Prediction of lncRNA-protein interactions using HeteSim scores based on heterogeneous networks , 2017, Scientific Reports.

[47]  Yanqing Niu,et al.  Accurate Prediction of Immunogenic T-Cell Epitopes from Epitope Sequences Using the Genetic Algorithm-Based Ensemble Learning , 2015, PloS one.

[48]  Yaniv Erlich,et al.  DNA Fountain enables a robust and efficient storage architecture , 2016, Science.

[49]  K. Chou,et al.  Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms , 2008, Nature Protocols.

[50]  Justin K. Huang,et al.  A global transcriptional network connecting noncoding mutations to changes in tumor gene expression , 2018, Nature Genetics.

[51]  Na-Na Guan,et al.  Predicting miRNA‐disease association based on inductive matrix completion , 2018, Bioinform..

[52]  John H. Maindonald Pattern Recognition and Machine Learning , 2007 .

[53]  Gang Tian,et al.  Accurate Prediction of Transposon-Derived piRNAs by Integrating Various Sequential and Physicochemical Features , 2016, PloS one.

[54]  Stephen A. Billings,et al.  Feature Subset Selection and Ranking for Data Dimensionality Reduction , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  Zixiang Wang,et al.  Computational identification of binding energy hot spots in protein–RNA complexes using an ensemble approach , 2018, Bioinform..

[56]  K. Chou,et al.  iPGK-PseAAC: Identify Lysine Phosphoglycerylation Sites in Proteins by Incorporating Four Different Tiers of Amino Acid Pairwise Coupling Information into the General PseAAC. , 2017, Medicinal chemistry (Shariqah (United Arab Emirates)).

[57]  Hui Zhang,et al.  HLPI-Ensemble: Prediction of human lncRNA-protein interactions based on ensemble strategy , 2018, RNA biology.

[58]  Juan Liu,et al.  Computational Prediction of Conformational B-Cell Epitopes from Antigen Primary Structures by Ensemble Learning , 2012, PloS one.

[59]  B. Liu,et al.  Identification of Real MicroRNA Precursors with a Pseudo Structure Status Composition Approach , 2015, PloS one.

[60]  K. Chou,et al.  iUbiq-Lys: prediction of lysine ubiquitination sites in proteins by extracting sequence evolution information via a gray system model , 2015, Journal of biomolecular structure & dynamics.

[61]  D. G. Gibson,et al.  Enzymatic assembly of DNA molecules up to several hundred kilobases , 2009, Nature Methods.

[62]  S. Elledge,et al.  SLIC: a method for sequence- and ligation-independent cloning. , 2012, Methods in molecular biology.

[63]  C. A. Thomas The genetic organization of chromosomes. , 1971, Annual review of genetics.

[64]  S. Koren,et al.  Nanopore sequencing and assembly of a human genome with ultra-long reads , 2017, bioRxiv.

[65]  Kateryna D. Makova,et al.  RecoverY: K-mer based read classification for Y-chromosome specific sequencing and assembly , 2017, bioRxiv.

[66]  Yucong Duan,et al.  70ProPred: a predictor for discovering sigma70 promoters based on combining multiple features , 2018, BMC Syst. Biol..

[67]  Taoying Li,et al.  Identification of S-nitrosylation sites based on multiple features combination , 2019, Scientific Reports.

[68]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[69]  Yongdong Zhang,et al.  Drug-target interaction prediction: databases, web servers and computational models , 2016, Briefings Bioinform..

[70]  Cangzhi Jia,et al.  EnhancerPred: a predictor for discovering enhancers based on the combination and selection of multiple features , 2016, Scientific Reports.

[71]  D. Schadendorf,et al.  TERT Promoter Mutations in Familial and Sporadic Melanoma , 2013, Science.

[72]  Hyeran Byun,et al.  Applications of Support Vector Machines for Pattern Recognition: A Survey , 2002, SVM.

[73]  Xing Gao,et al.  Enhanced Protein Fold Prediction Method Through a Novel Feature Extraction Technique , 2015, IEEE Transactions on NanoBioscience.

[74]  Philip Lijnzaad,et al.  The Ensembl genome database project , 2002, Nucleic Acids Res..

[75]  Renzhi Cao,et al.  SMOQ: a tool for predicting the absolute residue-specific quality of a single protein model with support vector machines , 2013, BMC Bioinformatics.

[76]  K. Chou,et al.  Prediction of protein structural classes. , 1995, Critical reviews in biochemistry and molecular biology.

[77]  Karl Rihaczek,et al.  1. WHAT IS DATA MINING? , 2019, Data Mining for the Social Sciences.

[78]  Lynda Chin,et al.  Highly Recurrent TERT Promoter Mutations in Human Melanoma , 2013, Science.

[79]  Cangzhi Jia,et al.  EnhancerPred2.0: predicting enhancers and their strength based on position-specific trinucleotide propensity and electron-ion interaction potential feature selection. , 2017, Molecular bioSystems.

[80]  Lei Wang,et al.  BNPMDA: Bipartite Network Projection for MiRNA–Disease Association prediction , 2018, Bioinform..

[81]  T. Katagiri,et al.  Cancer Diagnosis , 1992, Springer Berlin Heidelberg.

[82]  Ron Shamir,et al.  Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing , 2017, PLoS Comput. Biol..

[83]  Xing Chen,et al.  MicroRNAs and complex diseases: from experimental results to computational models , 2019, Briefings Bioinform..

[84]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001 .

[85]  Dong-Qing Wei,et al.  PredT4SE-Stack: Prediction of Bacterial Type IV Secreted Effectors From Protein Sequences Using a Stacked Ensemble Method , 2018, Front. Microbiol..

[86]  A. Valencia,et al.  Non-coding recurrent mutations in chronic lymphocytic leukaemia , 2015, Nature.

[87]  Dong Wang,et al.  iLoc‐lncRNA: predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC , 2018, Bioinform..

[88]  Xiaolong Wang,et al.  repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects , 2015, Bioinform..

[89]  Hua Tang,et al.  Identification of Secretory Proteins in Mycobacterium tuberculosis Using Pseudo Amino Acid Composition , 2016, BioMed research international.

[90]  Fan Yang,et al.  iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC , 2018, Bioinform..

[91]  Brent S. Pedersen,et al.  Nanopore sequencing and assembly of a human genome with ultra-long reads , 2017, Nature Biotechnology.

[92]  Yi Xiong,et al.  PDC-SGB: Prediction of effective drug combinations using a stochastic gradient boosting algorithm. , 2017, Journal of theoretical biology.

[93]  Jianhui Gong,et al.  Engineering the ribosomal DNA in a megabase synthetic chromosome , 2017, Science.

[94]  F. Vogel A Preliminary Estimate of the Number of Human Genes , 1964, Nature.

[95]  Zhaohui Qi,et al.  Recent Progress in Long Noncoding RNAs Prediction , 2017, Current Bioinformatics.

[96]  Zhu-Hong You,et al.  A novel approach based on KATZ measure to predict associations of human microbiota with non‐infectious diseases , 2016, Bioinform..

[97]  Rong Chen,et al.  HBPred: a tool to identify growth hormone-binding proteins , 2018, International journal of biological sciences.

[98]  Trevor J Pugh,et al.  Recurrent and functional regulatory mutations in breast cancer , 2017, Nature.

[99]  Miguel Melo,et al.  Frequency of TERT promoter mutations in human cancers , 2013, Nature Communications.

[100]  Christian von Mering,et al.  MAPseq: highly efficient k-mer search with confidence estimates, for rRNA sequence analysis , 2017, Bioinform..

[101]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[102]  Wen Zhang,et al.  Predicting human splicing branchpoints by combining sequence-derived features and multi-label learning methods , 2017, BMC Bioinformatics.