iRSpot-EL: identify recombination spots with an ensemble learning approach

Motivation: Coexisting in a DNA system, meiosis and recombination are two indispensible aspects for cell reproduction and growth. With the avalanche of genome sequences emerging in the post-genomic age, it is an urgent challenge to acquire the information of DNA recombination spots because it can timely provide very useful insights into the mechanism of meiotic recombination and the process of genome evolution. Results: To address such a challenge, we have developed a predictor, called iRSpot-EL, by fusing different modes of pseudo K-tuple nucleotide composition and mode of dinucleotide-based auto-cross covariance into an ensemble classifier of clustering approach. Five-fold cross tests on a widely used benchmark dataset have indicated that the new predictor remarkably outperforms its existing counterparts. Particularly, far beyond their reach, the new predictor can be easily used to conduct the genome-wide analysis and the results obtained are quite consistent with the experimental map. Availability and Implementation: For the convenience of most experimental scientists, a user-friendly web-server for iRSpot-EL has been established at http://bioinformatics.hitsz.edu.cn/iRSpot-EL/, by which users can easily obtain their desired results without the need to go through the complicated mathematical equations involved. Contact: bliu@gordonlifescience.org or bliu@insun.hit.edu.cn Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[2]  Xin Wang,et al.  PseAAC-Builder: a cross-platform stand-alone program for generating various special Chou's pseudo-amino acid compositions. , 2012, Analytical biochemistry.

[3]  Vladimir Vapnik,et al.  An overview of statistical learning theory , 1999, IEEE Trans. Neural Networks.

[4]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[5]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[6]  K. Chou,et al.  EzyPred: a top-down approach for predicting enzyme functional classes and subclasses. , 2007, Biochemical and biophysical research communications.

[7]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001 .

[8]  Dong-Sheng Cao,et al.  propy: a tool to generate various modes of Chou's PseAAC , 2013, Bioinform..

[9]  K. Chou,et al.  Prediction of linear B-cell epitopes using amino acid pair antigenicity scale , 2007, Amino Acids.

[10]  K. Chou,et al.  PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. , 2014, Analytical biochemistry.

[11]  Xiaolong Wang,et al.  repRNA: a web server for generating various feature vectors of RNA sequences , 2015, Molecular Genetics and Genomics.

[12]  K. Chou Impacts of bioinformatics to medicinal chemistry. , 2015, Medicinal chemistry (Shariqah (United Arab Emirates)).

[13]  Wei Chen,et al.  iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition , 2013, Nucleic acids research.

[14]  Jia Liu,et al.  Sequence-dependent prediction of recombination hotspots in Saccharomyces cerevisiae. , 2012, Journal of theoretical biology.

[15]  Wei Chen,et al.  iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition , 2014, Bioinform..

[16]  Xiaolong Wang,et al.  repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects , 2015, Bioinform..

[17]  Wei Chen,et al.  iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition , 2014, Nucleic acids research.

[18]  Pufeng Du,et al.  PseAAC-General: Fast Building Various Modes of General Form of Chou’s Pseudo-Amino Acid Composition for Large-Scale Protein Datasets , 2014, International journal of molecular sciences.

[19]  K. Chou,et al.  Recent progress in protein subcellular location prediction. , 2007, Analytical biochemistry.

[20]  Kuo-Chen Chou,et al.  iPhos-PseEn: Identifying phosphorylation sites in proteins by fusing different pseudo components into an ensemble classifier , 2016, Oncotarget.

[21]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[22]  Kuo-Chen Chou,et al.  Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes , 2005, Bioinform..

[23]  K. Chou,et al.  Prediction of protein structural classes. , 1995, Critical reviews in biochemistry and molecular biology.

[24]  Kuo-Chen Chou,et al.  iPTM-mLys: identifying multiple lysine PTM sites and their different types , 2016, Bioinform..

[25]  Ren Long,et al.  iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework , 2016, Bioinform..

[26]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[27]  Wei Chen,et al.  PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions , 2015, Bioinform..

[28]  Vivek Gopalan,et al.  High recombination rates and hotspots in a Plasmodium falciparum genetic cross , 2011, Genome Biology.

[29]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[30]  Wei Chen,et al.  iNuc-PhysChem: A Sequence-Based Predictor for Identifying Nucleosomes via Physicochemical Properties , 2012, PloS one.

[31]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[32]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[33]  L. Steinmetz,et al.  High-resolution mapping of meiotic crossovers and non-crossovers in yeast , 2008, Nature.

[34]  Adam Godzik,et al.  Clustering of highly homologous sequences to reduce the size of large protein databases , 2001, Bioinform..

[35]  Kuo-Chen Chou,et al.  pSumo-CD: predicting sumoylation sites in proteins with covariance discriminant algorithm by incorporating sequence-coupled effects into general PseAAC , 2016, Bioinform..

[36]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[37]  Kuo-Chen Chou,et al.  MemType-2L: a web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. , 2007, Biochemical and biophysical research communications.

[38]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[39]  K. Chou,et al.  Signal-3L: A 3-layer approach for predicting signal peptides. , 2007, Biochemical and biophysical research communications.

[40]  Swetlana Nikolajewa,et al.  DiProDB: a database for dinucleotide properties , 2008, Nucleic Acids Res..

[41]  Kuo-Chen Chou,et al.  Ensemble classifier for protein fold pattern recognition , 2006, Bioinform..

[42]  P. Brown,et al.  Global mapping of meiotic recombination hotspots and coldspots in the yeast Saccharomyces cerevisiae. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[43]  Zuhong Lu,et al.  Capturing Cryptosporidium. , 1996, Nucleic Acids Res..

[44]  K. Chou,et al.  iRSpot-TNCPseAAC: Identify Recombination Spots with Trinucleotide Composition and Pseudo Amino Acid Components , 2014, International journal of molecular sciences.

[45]  Ren Long,et al.  iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition , 2016, Bioinform..

[46]  Mehmet Fatih Akay,et al.  Support vector machines combined with feature selection for breast cancer diagnosis , 2009, Expert Syst. Appl..

[47]  K. Chou,et al.  Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms , 2008, Nature Protocols.

[48]  K. Chou,et al.  Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences. , 2015, Molecular bioSystems.