Application of Machine Learning Algorithms to Identify Recombination Spots

Meiotic recombination is a mechanism by which a cell promotes correct segregation of homologous chromosomes and repair of DNA damages. But it does not occur randomly across the whole genome. Relatively high frequencies meiotic recombination regions are identified as hotspots and relatively low frequencies meiotic recombination regions are identified as cold spots. But the accurate prediction of hot/cold spots is still an open challenge. Here, Recombination hotspots in a genome which are unevenly distributed. Again, hotspots are regions in a genome which show higher rates of meiotic recombination. Computational methods for recombination hotspot prediction often use sophisticated features which are derived from physio-chemical or structure-based properties of nucleotides. In this study we have taken a DNA data set. In this work, we have shown the uses of sequential based features which are computationally cheaper to generate. For this data set we used gapped k-mar composition. The data set which we have taken is a string data set. To do our work easier we have rearranged our string data set. Then we applied different algorithms on our data set to predict the result. It is also mentionable that we have tested our algorithm on standard benchmark dataset. Again, we also used 5-fold and 10-fold cross-validation in our dataset. Our analysis shows that compared to other methods, our work is able to produce significantly better results in terms of accuracy. For 5-fold cross-validation among all the algorithms SVM gives the best sensitivity and it is 0.7707. And, for 10-fold cross-validation, both LR and ANN gives best result of sensitivity and it is 0.7622. Here, the result of sensitivity for SVM is quite impressive and it is 0.7601.

[1]  Wei Chen,et al.  iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition , 2013, Nucleic acids research.

[2]  Kuo-Chen Chou,et al.  Signal-CF: a subsite-coupled and window-fusing approach for predicting signal peptides. , 2007, Biochemical and biophysical research communications.

[3]  Kuo-Chen Chou,et al.  iPTM-mLys: identifying multiple lysine PTM sites and their different types , 2016, Bioinform..

[4]  Kuo-Chen Chou,et al.  pLoc-mGneg: Predict subcellular localization of Gram-negative bacterial proteins by deep gene ontology learning via general PseAAC. , 2017, Genomics.

[5]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[6]  Zuhong Lu,et al.  Capturing Cryptosporidium. , 1996, Nucleic Acids Res..

[7]  K. Chou,et al.  iRSpot-TNCPseAAC: Identify Recombination Spots with Trinucleotide Composition and Pseudo Amino Acid Components , 2014, International journal of molecular sciences.

[8]  K. Chou,et al.  PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. , 2014, Analytical biochemistry.

[9]  Xiaolong Wang,et al.  Exploiting three kinds of interface propensities to identify protein binding sites , 2009, Comput. Biol. Chem..

[10]  Kuo-Chen Chou,et al.  pLoc‐mAnimal: predict subcellular localization of animal proteins with both single and multiple sites , 2017, Bioinform..

[11]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[12]  Kuo-Chen Chou,et al.  pLoc-mGpos: Incorporate Key Gene Ontology Information into General PseAAC for Predicting Subcellular Localization of Gram-Positive Bacterial Proteins , 2017 .

[13]  B. Liu,et al.  iRSpot-DACC: a computational predictor for recombination hot/cold spots identification based on dinucleotide-based auto-cross covariance , 2016, Scientific Reports.

[14]  Jia Liu,et al.  Sequence-dependent prediction of recombination hotspots in Saccharomyces cerevisiae. , 2012, Journal of theoretical biology.

[15]  Wei Chen,et al.  iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition , 2014, Nucleic acids research.

[16]  Kuo-Chen Chou,et al.  pLoc-mPlant: predict subcellular localization of multi-location plant proteins by incorporating the optimal GO information into general PseAAC. , 2017, Molecular bioSystems.

[17]  Ren Long,et al.  iRSpot-EL: identify recombination spots with an ensemble learning approach , 2017, Bioinform..

[18]  Kuo-Chen Chou,et al.  Some remarks on predicting multi-label attributes in molecular biosystems. , 2013, Molecular bioSystems.

[19]  Kuo-Chen Chou,et al.  pLoc-mVirus: Predict subcellular localization of multi-location virus proteins via incorporating the optimal GO information into general PseAAC. , 2017, Gene.

[20]  Kuo-Chen Chou,et al.  iATC-mHyb: a hybrid multi-label classifier for predicting the classification of anatomical therapeutic chemicals , 2017, Oncotarget.

[21]  P. Brown,et al.  Global mapping of meiotic recombination hotspots and coldspots in the yeast Saccharomyces cerevisiae. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[23]  Wei Chen,et al.  iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition , 2014, Bioinform..

[24]  Xiaolong Wang,et al.  repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects , 2015, Bioinform..

[25]  Kuo-Chen Chou,et al.  iATC-mISF: a multi-label classifier for predicting the classes of anatomical therapeutic chemicals , 2017, Bioinform..