A comparison and assessment of computational method for identifying recombination hotspots in Saccharomyces cerevisiae

Meiotic recombination is one of the most important driving forces of biological evolution, which is initiated by double-strand DNA breaks. Recombination has important roles in genome diversity and evolution. This review firstly provides a comprehensive survey of the 15 computational methods developed for identifying recombination hotspots in Saccharomyces cerevisiae. These computational methods were discussed and compared in terms of underlying algorithms, extracted features, predictive capability and practical utility. Subsequently, a more objective benchmark data set was constructed to develop a new predictor iRSpot-Pse6NC2.0 (http://lin-group.cn/server/iRSpot-Pse6NC2.0). To further demonstrate the generalization ability of these methods, we compared iRSpot-Pse6NC2.0 with existing methods on the chromosome XVI of S. cerevisiae. The results of the independent data set test demonstrated that the new predictor is superior to existing tools in the identification of recombination hotspots. The iRSpot-Pse6NC2.0 will become an important tool for identifying recombination hotspot.

[1]  Wei Chen,et al.  iRNA-2OM: A Sequence-Based Predictor for Identifying 2′-O-Methylation Sites in Homo sapiens , 2018, J. Comput. Biol..

[2]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[3]  Kuo-Chen Chou,et al.  iRSpot-Pse6NC: Identifying recombination spots in Saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC , 2018, International journal of biological sciences.

[4]  Zhangxin Chen,et al.  ProLanGO: Protein Function Prediction Using Neural Machine Translation Based on a Recurrent Neural Network , 2017, Molecules.

[5]  Guangpeng Li,et al.  PseKRAAC: a flexible web server for generating pseudo K-tuple reduced amino acids composition , 2017, Bioinform..

[6]  Xuan Zhu,et al.  A Hierarchical Combination of Factors Shapes the Genome-wide Topography of Yeast Meiotic Recombination Initiation , 2011, Cell.

[7]  Wei Chen,et al.  iTIS-PseTNC: a sequence-based predictor for identifying translation initiation site in human genes using pseudo trinucleotide composition. , 2014, Analytical biochemistry.

[8]  P. Brown,et al.  Global mapping of meiotic recombination hotspots and coldspots in the yeast Saccharomyces cerevisiae. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Hao Lin,et al.  Predicting subcellular localization of mycobacterial proteins by using Chou's pseudo amino acid composition. , 2008, Protein and peptide letters.

[10]  L. Duret,et al.  GC-content evolution in mammalian genomes: the biased gene conversion hypothesis. , 2001, Genetics.

[11]  Geoffrey I. Webb,et al.  iProt-Sub: a comprehensive package for accurately mapping and predicting protease-specific substrates and cleavage sites , 2018, Briefings Bioinform..

[12]  Daniel S. Yeung,et al.  Structured large margin machines: sensitive to data distributions , 2007, Machine Learning.

[13]  K. Chou,et al.  iSS-PseDNC: Identifying Splicing Sites Using Pseudo Dinucleotide Composition , 2014, BioMed research international.

[14]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[15]  Wei Chen,et al.  Predicting the subcellular localization of mycobacterial proteins by incorporating the optimal tripeptides into the general form of pseudo amino acid composition. , 2015, Molecular bioSystems.

[16]  K. Chou,et al.  iCTX-Type: A Sequence-Based Predictor for Identifying the Types of Conotoxins in Targeting Ion Channels , 2014, BioMed research international.

[17]  Renzhi Cao,et al.  Survey of Machine Learning Techniques in Drug Discovery. , 2019, Current drug metabolism.

[18]  Jie Hou,et al.  DeepQA: improving the estimation of single protein model quality with deep belief networks , 2016, BMC Bioinformatics.

[19]  Hao Lin,et al.  Identifying Sigma70 Promoters with Novel Pseudo Nucleotide Composition , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[20]  Xiao Sun,et al.  Support vector machine for classification of meiotic recombination hotspots and coldspots in Saccharomyces cerevisiae based on codon composition , 2005, BMC Bioinformatics.

[21]  Yong-qiang Xing,et al.  Using weighted features to predict recombination hotspots in Saccharomyces cerevisiae. , 2015, Journal of theoretical biology.

[22]  Hua Tang,et al.  Identification of immunoglobulins using Chou's pseudo amino acid composition with feature selection technique. , 2016, Molecular bioSystems.

[23]  Rong Chen,et al.  HBPred: a tool to identify growth hormone-binding proteins , 2018, International journal of biological sciences.

[24]  Yongchun Zuo,et al.  Function determinants of TET proteins: the arrangements of sequence motifs with specific codes , 2019, Briefings Bioinform..

[25]  Jiu-Xin Tan,et al.  Identification of hormone binding proteins based on machine learning methods. , 2019, Mathematical biosciences and engineering : MBE.

[26]  Mehmet Fatih Akay,et al.  Support vector machines combined with feature selection for breast cancer diagnosis , 2009, Expert Syst. Appl..

[27]  Geoffrey I. Webb,et al.  iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences , 2018, Bioinform..

[28]  Hua Tang,et al.  Identification of Secretory Proteins in Mycobacterium tuberculosis Using Pseudo Amino Acid Composition , 2016, BioMed research international.

[29]  Liujuan Cao,et al.  A novel features ranking metric with application to scalable visual and bioinformatics data classification , 2016, Neurocomputing.

[30]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[31]  Xingpeng Jiang,et al.  Sequence clustering in bioinformatics: an empirical study. , 2018, Briefings in bioinformatics.

[32]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[33]  K. Chou,et al.  iHSP-PseRAAAC: Identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. , 2013, Analytical biochemistry.

[34]  Gholamreza Haffari,et al.  PROSPERous: high-throughput prediction of substrate cleavage sites for 90 proteases with improved accuracy , 2018, Bioinform..

[35]  Wei Chen,et al.  iRNAD: a computational tool for identifying D modification sites in RNA sequence , 2019, Bioinform..

[36]  Jun Zhang,et al.  Identifying diseases-related metabolites using random walk , 2018, BMC Bioinformatics.

[37]  L. Hurst,et al.  Direct and indirect consequences of meiotic recombination: implications for genome evolution. , 2012, Trends in genetics : TIG.

[38]  H Philippe,et al.  Identification of putative chromosomal origins of replication in Archaea , 1999, Molecular microbiology.

[39]  Hsien-Da Huang,et al.  CircNet: a database of circular RNAs derived from transcriptome sequencing data , 2015, Nucleic Acids Res..

[40]  T. Hassold,et al.  Variation in human meiotic recombination. , 2004, Annual review of genomics and human genetics.

[41]  Jiu-Xin Tan,et al.  Evaluation of different computational methods on 5-methylcytosine sites identification , 2020, Briefings Bioinform..

[42]  Weidong Xiao,et al.  Sequence-based identification of recombination spots using pseudo nucleic acid representation and recursive feature extraction by linear kernel SVM , 2014, BMC Bioinformatics.

[43]  Wei Chen,et al.  iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition , 2013, Nucleic acids research.

[44]  Hua Tang,et al.  Identification of Bacterial Cell Wall Lyases via Pseudo Amino Acid Composition , 2016, BioMed research international.

[45]  Alain Nicolas,et al.  Histone H3 lysine 4 trimethylation marks meiotic recombination initiation sites , 2009, The EMBO journal.

[46]  Maqsood Hayat,et al.  iRSpot-GAEnsC: identifing recombination spots via ensemble classifier and extending the concept of Chou’s PseAAC to formulate DNA samples , 2015, Molecular Genetics and Genomics.

[47]  P. Donnelly,et al.  A Fine-Scale Map of Recombination Rates and Hotspots Across the Human Genome , 2005, Science.

[48]  Zuhong Lu,et al.  Capturing Cryptosporidium. , 1996, Nucleic Acids Res..

[49]  Ren Long,et al.  iRSpot-EL: identify recombination spots with an ensemble learning approach , 2017, Bioinform..

[50]  K. Chou,et al.  iRSpot-TNCPseAAC: Identify Recombination Spots with Trinucleotide Composition and Pseudo Amino Acid Components , 2014, International journal of molecular sciences.

[51]  Gwang Lee,et al.  PVP-SVM: Sequence-Based Prediction of Phage Virion Proteins Using a Support Vector Machine , 2018, Front. Microbiol..

[52]  Wei Chen,et al.  Sequence-based predictive modeling to identify cancerlectins , 2017, Oncotarget.

[53]  Bingjie Zhang and Guoqing Liu Predicting Recombination Hotspots in Yeast Based on DNA Sequence and Chromatin Structure , 2014 .

[54]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[55]  Quan Zou,et al.  Incorporating Distance-based Top-n-gram and Random Forest to Identify Electron Transport Proteins. , 2019, Journal of proteome research.

[56]  Liang Kong,et al.  iRSpot-ADPM: Identify recombination spots by incorporating the associated dinucleotide product model into Chou's pseudo components. , 2018, Journal of theoretical biology.

[57]  Quan Zou,et al.  Transcription factors-DNA interactions in rice: identification and verification , 2020, Briefings Bioinform..

[58]  Martin J Lercher,et al.  Human SNP variability and mutation rate are higher in regions of high recombination. , 2002, Trends in genetics : TIG.

[59]  Wei Chen,et al.  Predicting protein structural classes for low-similarity sequences by evaluating different features , 2019, Knowl. Based Syst..

[61]  Swakkhar Shatabda,et al.  iRSpot-SF: Prediction of recombination hotspots by incorporating sequence based features into Chou's Pseudo components. , 2019, Genomics.

[62]  Hao Lin The modified Mahalanobis Discriminant for predicting outer membrane proteins by using Chou's pseudo amino acid composition. , 2008, Journal of theoretical biology.

[63]  Leo Breiman,et al.  Random Forests: Finding Quasars , 2003 .

[64]  Jia Liu,et al.  Sequence-dependent prediction of recombination hotspots in Saccharomyces cerevisiae. , 2012, Journal of theoretical biology.

[65]  Balachandran Manavalan,et al.  Machine-Learning-Based Prediction of Cell-Penetrating Peptides and Their Uptake Efficiency with Improved Accuracy. , 2018, Journal of proteome research.

[66]  Shuai Liu,et al.  Transcriptome Comparisons of Multi-Species Identify Differential Genome Activation of Mammals Embryogenesis , 2019, IEEE Access.

[67]  Wei Chen,et al.  Combining pseudo dinucleotide composition with the Z curve method to improve the accuracy of predicting DNA elements: a case study in recombination spots. , 2016, Molecular bioSystems.

[68]  B. Liu,et al.  iRSpot-DACC: a computational predictor for recombination hot/cold spots identification based on dinucleotide-based auto-cross covariance , 2016, Scientific Reports.

[69]  Jian Huang,et al.  A Brief Survey of Machine Learning Methods in Protein Sub-Golgi Localization , 2019, Current Bioinformatics.

[70]  Wei Chen,et al.  iProEP: A Computational Predictor for Predicting Promoter , 2019, Molecular therapy. Nucleic acids.

[71]  K. Chou,et al.  PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. , 2014, Analytical biochemistry.

[72]  P. Donnelly,et al.  Drive Against Hotspot Motifs in Primates Implicates the PRDM9 Gene in Meiotic Recombination , 2010, Science.

[73]  S. Keeney Spo11 and the Formation of DNA Double-Strand Breaks in Meiosis. , 2008, Genome dynamics and stability.

[74]  Qinghua Guo,et al.  LncRNA2Target v2.0: a comprehensive database for target genes of lncRNAs in human and mouse , 2018, Nucleic Acids Res..

[75]  L. Steinmetz,et al.  High-resolution mapping of meiotic crossovers and non-crossovers in yeast , 2008, Nature.

[76]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[77]  Chih-Jen Lin,et al.  The analysis of decomposition methods for support vector machines , 2000, IEEE Trans. Neural Networks Learn. Syst..

[78]  A. Nicolas,et al.  Clustering of meiotic double-strand breaks on yeast chromosome III. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[79]  Jiajie Peng,et al.  InfAcrOnt: calculating cross-ontology term similarities using information flow by a random walk , 2018, BMC Genomics.

[80]  Jie Sun,et al.  DincRNA: a comprehensive web-based bioinformatics toolkit for exploring disease associations and ncRNA function , 2018, Bioinform..