iRSpot-SPI: Deep learning-based recombination spots prediction by incorporating secondary sequence information coupled with physio-chemical properties via Chou's 5-step rule and pseudo components

Abstract Meiotic recombination plays an important role in the process of genetic diversity generation. “Hotspots” are regions that show a higher rate of meiotic recombination, while the regions with a lower rate are called “cold spots”. It has a great effect on the genome evolution via gene conversion or mutagenesis. According to recent research, recombination is present in uneven distribution across the genome. Many computational methods have been developed using secondary sequence information or physiochemical properties of nucleotide descriptor for the prediction of hotspots and cold spots, which are computationally cheap and fast in performance rather than web-lab experiments, but the correlations between nucleotides pairs at different positions along DNA sequence is often ignored, which conceal a very important predictive information. In this study, we have proposed a deep neural network to predict recombination spots by fusing both the secondary sequence information and physio-chemical derived features. Our deep learning algorithm leverage's deep dense architecture by showing its effectiveness over the state-of-the-art methods with a classification accuracy of 90.04%, sensitivity of 92.21%, specificity of 92.11% and area under the curve of 0.9801. Moreover, it is anticipated, that our model will provide novel insight into basic research, drug designing, academic research and recombination spots studies particularly. All the methodology and python-based source code is publicly available for the users at https://github.com/zaheerkhancs/irSpot_SPI along with publicly accessible web server using the proposed predictor.

[1]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[2]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[3]  Liang Kong,et al.  iRSpot-ADPM: Identify recombination spots by incorporating the associated dinucleotide product model into Chou's pseudo components. , 2018, Journal of theoretical biology.

[4]  Dong-Sheng Cao,et al.  propy: a tool to generate various modes of Chou's PseAAC , 2013, Bioinform..

[5]  Xin Wang,et al.  PseAAC-Builder: a cross-platform stand-alone program for generating various special Chou's pseudo-amino acid compositions. , 2012, Analytical biochemistry.

[6]  Yung-Hsiang Hung,et al.  SVM-RFE Based Feature Selection and Taguchi Parameters Optimization for Multiclass SVM Classifier , 2014, TheScientificWorldJournal.

[7]  Zaheer Ullah Khan,et al.  DBPPred-PDSD: Machine learning approach for prediction of DNA-binding proteins using Discrete Wavelet Transform and optimized integrated features space , 2018, Chemometrics and Intelligent Laboratory Systems.

[8]  Akiko Aizawa,et al.  An information-theoretic perspective of tf-idf measures , 2003, Inf. Process. Manag..

[9]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[10]  Abdollah Dehzangi,et al.  HMMBinder: DNA-Binding Protein Prediction Using HMM Profile Based Features , 2017, BioMed research international.

[11]  Swakkhar Shatabda,et al.  iRSpot-SF: Prediction of recombination hotspots by incorporating sequence based features into Chou's Pseudo components. , 2019, Genomics.

[12]  Giovanni Luca Christian Masala,et al.  A comparative study of K-Nearest Neighbour, Support Vector Machine and Multi-Layer Perceptron for Thalassemia screening , 2003 .

[13]  Hong Gu,et al.  Predicting lysine phosphoglycerylation with fuzzy SVM by incorporating k-spaced amino acid pairs into Chou׳s general PseAAC. , 2016, Journal of theoretical biology.

[14]  Dong-Sheng Cao,et al.  The boosting: A new idea of building models , 2010 .

[15]  K. Chou,et al.  pLoc-mEuk: Predict subcellular localization of multi-label eukaryotic proteins by extracting the key GO information into general PseAAC. , 2018, Genomics.

[16]  Kuo-Chen Chou,et al.  pLoc-mHum: predict subcellular localization of multi-location human proteins via general PseAAC to winnow out the crucial GO information , 2018, Bioinform..

[17]  Pufeng Du,et al.  PseAAC-General: Fast Building Various Modes of General Form of Chou’s Pseudo-Amino Acid Composition for Large-Scale Protein Datasets , 2014, International journal of molecular sciences.

[18]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[19]  Kuo-Chen Chou,et al.  pLoc_bal-mGneg: Predict subcellular localization of Gram-negative bacterial proteins by quasi-balancing training dataset and general PseAAC. , 2018, Journal of theoretical biology.

[20]  M. DePristo,et al.  Deep learning of genomic variation and regulatory network data. , 2018, Human molecular genetics.

[21]  K. Chou,et al.  Wenxiang: a web-server for drawing wenxiang diagrams , 2011 .

[22]  K. Chou,et al.  Recent progress in protein subcellular location prediction. , 2007, Analytical biochemistry.

[23]  Richard Weber,et al.  A wrapper method for feature selection using Support Vector Machines , 2009, Inf. Sci..

[24]  Byunghan Lee,et al.  Deep learning in bioinformatics , 2016, Briefings Bioinform..

[25]  Weidong Xiao,et al.  Sequence-based identification of recombination spots using pseudo nucleic acid representation and recursive feature extraction by linear kernel SVM , 2014, BMC Bioinformatics.

[26]  B. Liu,et al.  iRSpot-DACC: a computational predictor for recombination hot/cold spots identification based on dinucleotide-based auto-cross covariance , 2016, Scientific Reports.

[27]  Alessandro Ulrici,et al.  Practical comparison of sparse methods for classification of Arabica and Robusta coffee species using near infrared hyperspectral imaging , 2015 .

[28]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[29]  Kuo-Chen Chou,et al.  pLoc_bal-mGpos: Predict subcellular localization of Gram-positive bacterial proteins by quasi-balancing training dataset and PseAAC. , 2019, Genomics.

[30]  Hui Ding,et al.  Using deformation energy to analyze nucleosome positioning in genomes. , 2016, Genomics.

[31]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[32]  James G. Lyons,et al.  Gram-positive and Gram-negative protein subcellular localization by incorporating evolutionary-based descriptors into Chou׳s general PseAAC. , 2015, Journal of theoretical biology.

[33]  Xuegong Zhang,et al.  Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data , 2006, BMC Bioinformatics.

[34]  Wei Chen,et al.  Identification of apolipoprotein using feature selection technique , 2016, Scientific Reports.

[35]  Kuo-Chen Chou,et al.  iATC‐mISF: a multi‐label classifier for predicting the classes of anatomical therapeutic chemicals , 2016, Bioinform..

[36]  Kuo-Chen Chou,et al.  Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes , 2005, Bioinform..

[37]  K. Chou,et al.  iACP: a sequence-based tool for identifying anticancer peptides , 2016, Oncotarget.

[38]  Kuo-Chen Chou,et al.  Some remarks on predicting multi-label attributes in molecular biosystems. , 2013, Molecular bioSystems.

[39]  Zixiang Xiong,et al.  Optimal number of features as a function of sample size for various classification rules , 2005, Bioinform..

[40]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[41]  Wei Chen,et al.  iTIS-PseTNC: a sequence-based predictor for identifying translation initiation site in human genes using pseudo trinucleotide composition. , 2014, Analytical biochemistry.

[42]  Maqsood Hayat,et al.  iRSpot-GAEnsC: identifing recombination spots via ensemble classifier and extending the concept of Chou’s PseAAC to formulate DNA samples , 2015, Molecular Genetics and Genomics.

[43]  Lei Yang,et al.  Prediction of presynaptic and postsynaptic neurotoxins by combining various Chou’s pseudo components , 2017, Scientific Reports.

[44]  David R. Kelley,et al.  Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks , 2015, bioRxiv.

[45]  Lu Xu,et al.  Bagging classification tree-based robust variable selection for radial basis function network modeling in metabonomics data analysis , 2018 .

[46]  Wei Chen,et al.  iDNA4mC: identifying DNA N4‐methylcytosine sites based on nucleotide chemical properties , 2017, Bioinform..

[47]  Zuhong Lu,et al.  Capturing Cryptosporidium. , 1996, Nucleic Acids Res..

[48]  K. Chou,et al.  iRSpot-TNCPseAAC: Identify Recombination Spots with Trinucleotide Composition and Pseudo Amino Acid Components , 2014, International journal of molecular sciences.

[49]  Wei Chen,et al.  Combining pseudo dinucleotide composition with the Z curve method to improve the accuracy of predicting DNA elements: a case study in recombination spots. , 2016, Molecular bioSystems.

[50]  O. Stegle,et al.  Deep learning for computational biology , 2016, Molecular systems biology.

[51]  Junghui Chen,et al.  Application of wavelet analysis and decision tree in UTDR data for diagnosis of membrane filtration , 2012 .

[52]  Morteza Mohammad Noori,et al.  Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features , 2014, PLoS Comput. Biol..

[53]  Alex Zhavoronkov,et al.  Applications of Deep Learning in Biomedicine. , 2016, Molecular pharmaceutics.

[54]  Jia Liu,et al.  Sequence-dependent prediction of recombination hotspots in Saccharomyces cerevisiae. , 2012, Journal of theoretical biology.

[55]  Guo-Ping Zhou The disposition of the LZCC protein residues in wenxiang diagram provides new insights into the protein–protein interaction mechanism , 2011, Journal of Theoretical Biology.

[56]  Wei Chen,et al.  PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions , 2015, Bioinform..

[57]  Kuo-Chen Chou,et al.  pLoc-mGneg: Predict subcellular localization of Gram-negative bacterial proteins by deep gene ontology learning via general PseAAC. , 2017, Genomics.

[58]  Geoffrey I. Webb,et al.  iProt-Sub: a comprehensive package for accurately mapping and predicting protease-specific substrates and cleavage sites , 2018, Briefings Bioinform..

[59]  David K. Gifford,et al.  Convolutional neural network architectures for predicting DNA–protein binding , 2016, Bioinform..

[60]  G. Coop,et al.  PRDM9 Is a Major Determinant of Meiotic Recombination Hotspots in Humans and Mice , 2010, Science.

[61]  Yuanyuan Ding,et al.  Improving the Performance of SVM-RFE to Select Genes in Microarray Data , 2006, BMC Bioinformatics.

[62]  Kuo-Chen Chou,et al.  pLoc-mVirus: Predict subcellular localization of multi-location virus proteins via incorporating the optimal GO information into general PseAAC. , 2017, Gene.

[63]  K. Chou Pseudo Amino Acid Composition and its Applications in Bioinformatics, Proteomics and System Biology , 2009 .

[64]  Kuo-Chen Chou,et al.  Identification of proteases and their types. , 2009, Analytical biochemistry.

[65]  Zheng Fang,et al.  Systematic analysis revealed better performance of random forest algorithm coupled with complex network features in predicting microRNA precursors , 2012 .

[66]  Hong Gu,et al.  iLM-2L: A two-level predictor for identifying protein lysine methylation sites and their methylation degrees by incorporating K-gap amino acid pairs into Chou׳s general PseAAC. , 2015, Journal of theoretical biology.

[67]  K. Chou Impacts of bioinformatics to medicinal chemistry. , 2015, Medicinal chemistry (Shariqah (United Arab Emirates)).

[68]  Kuo-Chen Chou,et al.  iATC-mHyb: a hybrid multi-label classifier for predicting the classification of anatomical therapeutic chemicals , 2017, Oncotarget.

[69]  K. Chou,et al.  REVIEW : Recent advances in developing web-servers for predicting protein attributes , 2009 .

[70]  Tommy Kaplan,et al.  Enhancer Identification using Transfer and Adversarial Deep Learning of DNA Sequences , 2018, bioRxiv.

[71]  Zaheer Ullah Khan,et al.  Discrimination of acidic and alkaline enzyme using Chou's pseudo amino acid composition in conjunction with probabilistic neural network model. , 2015, Journal of theoretical biology.

[72]  A. Goldman,et al.  Meiotic recombination hotspots. , 1995, Annual review of genetics.

[73]  Maqsood Hayat,et al.  Predicting subcellular localization of multi-label proteins by incorporating the sequence features into Chou's PseAAC. , 2019, Genomics.

[74]  K. Chou Applications of graph theory to enzyme kinetics and protein folding kinetics. Steady and non-steady-state systems. , 2020, Biophysical chemistry.

[75]  Abdollah Dehzangi,et al.  iDTI-ESBoost: Identification of Drug Target Interaction Using Evolutionary and Structural Features with Boosting , 2017, Scientific Reports.

[76]  Kuo-Chen Chou,et al.  pLoc-mPlant: predict subcellular localization of multi-location plant proteins by incorporating the optimal GO information into general PseAAC. , 2017, Molecular bioSystems.

[77]  Ren Long,et al.  iRSpot-EL: identify recombination spots with an ensemble learning approach , 2017, Bioinform..

[78]  S. Forsén,et al.  Graphical rules for enzyme-catalysed rate laws. , 1980, The Biochemical journal.

[79]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[80]  Fan Yang,et al.  iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC , 2018, Bioinform..

[81]  Kuo-Chen Chou,et al.  pLoc-mGpos: Incorporate Key Gene Ontology Information into General PseAAC for Predicting Subcellular Localization of Gram-Positive Bacterial Proteins , 2017 .

[82]  Bin Liu,et al.  Pse-in-One 2.0: An Improved Package of Web Servers for Generating Various Modes of Pseudo Components of DNA, RNA, and Protein Sequences , 2017 .

[83]  Asifullah Khan,et al.  Predicting membrane protein types by fusing composite protein sequence features into pseudo amino acid composition. , 2011, Journal of theoretical biology.

[84]  Xiaohui S. Xie,et al.  DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences , 2015, bioRxiv.

[85]  Sajid Ahmed,et al.  iRecSpot-EF: Effective sequence based features for recombination hotspot prediction , 2018, Comput. Biol. Medicine.

[86]  Kuo-Chen Chou,et al.  iRSpot-Pse6NC: Identifying recombination spots in Saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC , 2018, International journal of biological sciences.

[87]  Maqsood Hayat,et al.  Machine learning approaches for discrimination of Extracellular Matrix proteins using hybrid feature space. , 2016, Journal of theoretical biology.

[88]  K. Chou,et al.  PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. , 2014, Analytical biochemistry.

[89]  Zhiqiu Huang,et al.  TRFIoT: Trust and Reputation Model for Fog-based IoT , 2018, ICCCS.

[90]  K. Chou Graphic rule for drug metabolism systems. , 2010, Current drug metabolism.

[91]  Peter Donnelly,et al.  The Influence of Recombination on Human Genetic Diversity , 2006, PLoS genetics.

[92]  K. Chou,et al.  Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences. , 2015, Molecular bioSystems.

[93]  Wei Chen,et al.  A deep learning framework for sequence-based bacteria type IV secreted effectors prediction , 2018, Chemometrics and Intelligent Laboratory Systems.

[94]  Wei Chen,et al.  iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition , 2013, Nucleic acids research.

[95]  Kuo-Chen Chou,et al.  An Unprecedented Revolution in Medicinal Chemistry Driven by the Progress of Biological Science. , 2017, Current topics in medicinal chemistry.

[96]  Nilanjan Dey,et al.  Optimal choice of k-mer in composition vector method for genome sequence comparison. , 2017, Genomics.

[97]  Maqsood Hayat,et al.  Author ' s Accepted Manuscript Classification of membrane protein types using Voting feature interval in combination with Chou ' s pseudo amino acid composition , 2015 .