iRecSpot-EF: Effective sequence based features for recombination hotspot prediction

In genetic evolution, meiotic recombination plays an important role. Recombination introduces genetic variations and is a vital source of biodiversity and appears as a driving force in evolutionary development. Local regions of chromosomes where recombination events tend to be concentrated are known as hotspots and regions with relatively low frequencies of recombination are called coldspots. Predicting hotspots and coldspots can enlighten structure of recombination and genome evolution. In this paper, we proposed a predictor, called iRecSpot-EF to predict recombination hot and cold spots. iRecSpot-EF uses a novel set of features extracted from the genome sequences. We introduce the frequency of (l,k,p)-mers in the sequence as features. Our proposed feature extraction method hinges solely upon the nucleotide sequences, thus being cost-effective and robust. After feature extraction, the most informative features are selected using AdaBoost algorithm. We have selected logistic regression as the classification algorithm. iRecSpot-EF was tested on a standard benchmark dataset using cross-fold validation. It achieved an accuracy of 95.14% and area under Receiver Operating Characteristic curve (auROC) of 0.985. The performance of iRecSpot-EF is significantly better than the state-of-the-art methods. iRecSpot-EF is readily available for use from http://iRecSpot.pythonanywhere.com/server. All relevant codes are available via open repository at: https://github.com/mrzResearchArena/iRecSpot.

[1]  Wei Chen,et al.  iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition , 2013, Nucleic acids research.

[2]  G. Coop,et al.  PRDM9 Is a Major Determinant of Meiotic Recombination Hotspots in Humans and Mice , 2010, Science.

[3]  Michael Krawczak,et al.  Translocation and gross deletion breakpoints in human inherited disease and cancer I: Nucleotide composition and recombination‐associated motifs , 2003, Human mutation.

[4]  Xiaolong Wang,et al.  repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects , 2015, Bioinform..

[5]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[6]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[7]  Jody Hey,et al.  What's So Hot about Recombination Hotspots? , 2004, PLoS biology.

[8]  Bin Liu,et al.  Pse-in-One 2.0: An Improved Package of Web Servers for Generating Various Modes of Pseudo Components of DNA, RNA, and Protein Sequences , 2017 .

[9]  B. Liu,et al.  iRSpot-DACC: a computational predictor for recombination hot/cold spots identification based on dinucleotide-based auto-cross covariance , 2016, Scientific Reports.

[10]  Xiao Sun,et al.  Support vector machine for classification of meiotic recombination hotspots and coldspots in Saccharomyces cerevisiae based on codon composition , 2005, BMC Bioinformatics.

[11]  R Zhang,et al.  Z curves, an intutive tool for visualizing and analyzing the DNA sequences. , 1994, Journal of biomolecular structure & dynamics.

[12]  Trevor Hastie,et al.  Support Vector Machines , 2013 .

[13]  Ren Zhang,et al.  A Brief Review: The Z-curve Theory and its Application in Genome Analysis , 2014, Current genomics.

[14]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[15]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[16]  Yong-qiang Xing,et al.  Using weighted features to predict recombination hotspots in Saccharomyces cerevisiae. , 2015, Journal of theoretical biology.

[17]  Wei Chen,et al.  The organization of nucleosomes around splice sites , 2010, Nucleic acids research.

[18]  L. Steinmetz,et al.  High-resolution mapping of meiotic crossovers and non-crossovers in yeast , 2008, Nature.

[19]  Maqsood Hayat,et al.  iRSpot-GAEnsC: identifing recombination spots via ensemble classifier and extending the concept of Chou’s PseAAC to formulate DNA samples , 2015, Molecular Genetics and Genomics.

[20]  M. Stephens,et al.  Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. , 2003, Genetics.

[21]  Zuhong Lu,et al.  Capturing Cryptosporidium. , 1996, Nucleic Acids Res..

[22]  K. Chou,et al.  iRSpot-TNCPseAAC: Identify Recombination Spots with Trinucleotide Composition and Pseudo Amino Acid Components , 2014, International journal of molecular sciences.

[23]  Ren Long,et al.  iRSpot-EL: identify recombination spots with an ensemble learning approach , 2017, Bioinform..

[24]  A. Izenman Linear Discriminant Analysis , 2013 .

[25]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[26]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[27]  Trevor Hastie,et al.  Multi-class AdaBoost ∗ , 2009 .

[28]  Jia Liu,et al.  Sequence-dependent prediction of recombination hotspots in Saccharomyces cerevisiae. , 2012, Journal of theoretical biology.

[29]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[30]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[31]  R Zhang,et al.  Analysis of distribution of bases in the coding sequences by a diagrammatic technique. , 1991, Nucleic acids research.

[32]  Abdollah Dehzangi,et al.  iDNAProt-ES: Identification of DNA-binding Proteins Using Evolutionary and Structural Features , 2017, Scientific Reports.

[33]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[34]  A Grigoriev,et al.  Analyzing genomes with cumulative skew diagrams. , 1998, Nucleic acids research.

[35]  Irina Rish,et al.  An empirical study of the naive Bayes classifier , 2001 .

[36]  A. Jeffreys,et al.  Intensely punctate meiotic recombination in the class II region of the major histocompatibility complex , 2001, Nature Genetics.

[37]  D. Cox The Regression Analysis of Binary Sequences , 1958 .