Prediction of Recombination Spots Using Novel Hybrid Feature Extraction Method via Deep Learning Approach

Meiotic recombination is the driving force of evolutionary development and an important source of genetic variation. The meiotic recombination does not take place randomly in a chromosome but occurs in some regions of the chromosome. A region in chromosomes with higher rate of meiotic recombination events are considered as hotspots and a region where frequencies of the recombination events are lower are called coldspots. Prediction of meiotic recombination spots provides useful information about the basic functionality of inheritance and genome diversity. This study proposes an intelligent computational predictor called iRSpots-DNN for the identification of recombination spots. The proposed predictor is based on a novel feature extraction method and an optimized deep neural network (DNN). The DNN was employed as a classification engine whereas, the novel features extraction method was developed to extract meaningful features for the identification of hotspots and coldspots across the yeast genome. Unlike previous algorithms, the proposed feature extraction avoids bias among different selected features and preserved the sequence discriminant properties along with the sequence-structure information simultaneously. This study also considered other effective classifiers named support vector machine (SVM), K-nearest neighbor (KNN), and random forest (RF) to predict recombination spots. Experimental results on a benchmark dataset with 10-fold cross-validation showed that iRSpots-DNN achieved the highest accuracy, i.e., 95.81%. Additionally, the performance of the proposed iRSpots-DNN is significantly better than the existing predictors on a benchmark dataset. The relevant benchmark dataset and source code are freely available at: https://github.com/Fatima-Khan12/iRspot_DNN/tree/master/iRspot_DNN.

[1]  Hong-Bin Shen,et al.  TargetM6A: Identifying N6-Methyladenosine Sites From RNA Sequences via Position-Specific Nucleotide Propensities and a Support Vector Machine , 2016, IEEE Transactions on NanoBioscience.

[2]  Leonhard Blesius,et al.  Tree Species Classification Using Hyperspectral Imagery: A Comparison of Two Classifiers , 2016, Remote. Sens..

[3]  Hilal Tayara,et al.  iPseU-CNN: Identifying RNA Pseudouridine Sites Using Convolutional Neural Networks , 2019, Molecular therapy. Nucleic acids.

[4]  K. Chou Impacts of bioinformatics to medicinal chemistry. , 2015, Medicinal chemistry (Shariqah (United Arab Emirates)).

[5]  Wei Chen,et al.  iOri-Human: identify human origin of replication by incorporating dinucleotide physicochemical properties into pseudo nucleotide composition , 2016, Oncotarget.

[6]  Kathleen H. Miao,et al.  Cardiotocographic Diagnosis of Fetal Health based on Multiclass Morphologic Pattern Predictions using Deep Learning Classification , 2018 .

[7]  Geoffrey I. Webb,et al.  iLearn : an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data , 2019, Briefings Bioinform..

[8]  Robert W. Donaldson,et al.  Approximate formulas for the information transmitted by a discrete communication channel (Corresp.) , 1967, IEEE Trans. Inf. Theory.

[9]  Chih-Fong Tsai,et al.  The distance function effect on k-nearest neighbor classification for medical datasets , 2016, SpringerPlus.

[10]  Robert P. Sheridan,et al.  Deep Neural Nets as a Method for Quantitative Structure-Activity Relationships , 2015, J. Chem. Inf. Model..

[11]  Lei Yang,et al.  Discrimination of membrane transporter protein types using K-nearest neighbor method derived from the similarity distance of total diversity measure. , 2015, Molecular bioSystems.

[12]  Kuo-Chen Chou,et al.  iRSpot-Pse6NC: Identifying recombination spots in Saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC , 2018, International journal of biological sciences.

[13]  Shahid Akbar,et al.  iMethyl-STTNC: Identification of N6-methyladenosine sites by extending the idea of SAAC into Chou's PseAAC to formulate RNA sequences. , 2018, Journal of theoretical biology.

[14]  Arbab Waseem Abbas,et al.  Database development and automatic speech recognition of isolated Pashto spoken digits using MFCC and K-NN , 2015, Int. J. Speech Technol..

[15]  Ashok Kumar Dwivedi Artificial neural network model for effective cancer classification using microarray gene expression data , 2018, Neural Computing and Applications.

[16]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[17]  Xiaohui S. Xie,et al.  DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences , 2015, bioRxiv.

[18]  Sajid Ahmed,et al.  iRecSpot-EF: Effective sequence based features for recombination hotspot prediction , 2018, Comput. Biol. Medicine.

[19]  Sher Afzal Khan,et al.  Bi-PSSM: Position specific scoring matrix based intelligent computational model for identification of mycobacterial membrane proteins. , 2017, Journal of theoretical biology.

[20]  K. Chou,et al.  Bioinformatical analysis of G-protein-coupled receptors. , 2002, Journal of proteome research.

[21]  C. V. D. Malsburg,et al.  Frank Rosenblatt: Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms , 1986 .

[22]  Saeed Ahmad,et al.  Identification of Heat Shock Protein families and J-protein types by incorporating Dipeptide Composition into Chou's general PseAAC , 2015, Comput. Methods Programs Biomed..

[23]  Tommy Kaplan,et al.  Enhancer Identification using Transfer and Adversarial Deep Learning of DNA Sequences , 2018, bioRxiv.

[24]  Brendan J. Frey,et al.  Deep learning of the tissue-regulated splicing code , 2014, Bioinform..

[25]  Swakkhar Shatabda,et al.  iRSpot-SF: Prediction of recombination hotspots by incorporating sequence based features into Chou's Pseudo components. , 2019, Genomics.

[26]  Kuo-Chen Chou,et al.  2L-piRNA: A Two-Layer Ensemble Classifier for Identifying Piwi-Interacting RNAs and Their Function , 2017, Molecular therapy. Nucleic acids.

[27]  K. Chou,et al.  Support vector machines for predicting membrane protein types by using functional domain composition. , 2003, Biophysical journal.

[28]  Wei Chen,et al.  Combining pseudo dinucleotide composition with the Z curve method to improve the accuracy of predicting DNA elements: a case study in recombination spots. , 2016, Molecular bioSystems.

[29]  O. Stegle,et al.  Deep learning for computational biology , 2016, Molecular systems biology.

[30]  K. Chou Using subsite coupling to predict signal peptides. , 2001, Protein engineering.

[31]  Jody Hey,et al.  What's So Hot about Recombination Hotspots? , 2004, PLoS biology.

[32]  B. Liu,et al.  Recombination spot identification Based on gapped k-mers , 2016, Scientific Reports.

[33]  U. Rajendra Acharya,et al.  Deep convolutional neural network for the automated detection and diagnosis of seizure using EEG signals , 2017, Comput. Biol. Medicine.

[34]  A. Goldman,et al.  Meiotic recombination hotspots. , 1995, Annual review of genetics.

[35]  Peng Jiang,et al.  MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features , 2007, Nucleic Acids Res..

[36]  Kuo-Chen Chou,et al.  A Two-Level Computation Model Based on Deep Learning Algorithm for Identification of piRNA and Their Functions via Chou’s 5-Steps Rule , 2019, International Journal of Peptide Research and Therapeutics.

[37]  Weidong Xiao,et al.  Sequence-based identification of recombination spots using pseudo nucleic acid representation and recursive feature extraction by linear kernel SVM , 2014, BMC Bioinformatics.

[38]  H Philippe,et al.  Identification of putative chromosomal origins of replication in Archaea , 1999, Molecular microbiology.

[39]  Shichao Zhang,et al.  kNN Algorithm with Data-Driven k Value , 2014, ADMA.

[40]  T. Petes,et al.  Meiotic recombination hot spots and cold spots , 2001, Nature Reviews Genetics.

[41]  Bridget Fowler,et al.  A Sociological Analysis of the Satanic Verses Affair , 2000 .

[42]  Maqsood Hayat,et al.  iRSpot-GAEnsC: identifing recombination spots via ensemble classifier and extending the concept of Chou’s PseAAC to formulate DNA samples , 2015, Molecular Genetics and Genomics.

[43]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Michael Krawczak,et al.  Translocation and gross deletion breakpoints in human inherited disease and cancer I: Nucleotide composition and recombination‐associated motifs , 2003, Human mutation.

[45]  Wei Chen,et al.  iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition , 2014, Bioinform..

[46]  Jason Weston,et al.  Question Answering with Subgraph Embeddings , 2014, EMNLP.

[47]  Jaques Reifman,et al.  Support vector machines with selective kernel scaling for protein classification and identification of key amino acid positions , 2002, Bioinform..

[48]  Mostafa Hosseini,et al.  Part 1: Simple Definition and Calculation of Accuracy, Sensitivity and Specificity , 2015, Emergency.

[49]  Mukhtaj Khan,et al.  Identifying 5-methylcytosine sites in RNA sequence using composite encoding feature into Chou's PseKNC. , 2018, Journal of theoretical biology.

[50]  Xiangrong Liu,et al.  Sc-ncDNAPred: A Sequence-Based Predictor for Identifying Non-coding DNA in Saccharomyces cerevisiae , 2018, Front. Microbiol..

[51]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[52]  Zuhong Lu,et al.  Capturing Cryptosporidium. , 1996, Nucleic Acids Res..

[53]  Jonathan Tompson,et al.  Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.

[54]  Yanzhi Guo,et al.  Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences , 2008, Nucleic acids research.

[55]  Kuo-Chen Chou,et al.  iPPI-Esml: An ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into PseAAC. , 2015, Journal of theoretical biology.

[56]  K. Chou,et al.  iRSpot-TNCPseAAC: Identify Recombination Spots with Trinucleotide Composition and Pseudo Amino Acid Components , 2014, International journal of molecular sciences.

[57]  Liping,et al.  SVM CLASSIFICATION:ITS CONTENTS AND CHALLENGES , 2003 .

[58]  Khalid Raza,et al.  Improving the prediction accuracy of heart disease with ensemble learning and majority voting rule , 2019, U-Healthcare Monitoring Systems.

[59]  Ioannis Papoutsis,et al.  Scalable Parcel-Based Crop Identification Scheme Using Sentinel-2 Data Time-Series for the Monitoring of the Common Agricultural Policy , 2018, Remote. Sens..

[60]  Sher Afzal Khan,et al.  Prediction of piRNAs and their function based on discriminative intelligent model using hybrid features into Chou’s PseKNC , 2020 .

[61]  Zhe Zhu,et al.  Deep Learning for identifying radiogenomic associations in breast cancer , 2017, Comput. Biol. Medicine.

[62]  David R. Kelley,et al.  Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks , 2015, bioRxiv.

[63]  K. Chou,et al.  iDNA6mA-PseKNC: Identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. , 2018, Genomics.

[64]  Ren Long,et al.  iRSpot-EL: identify recombination spots with an ensemble learning approach , 2017, Bioinform..

[65]  Wei Chen,et al.  iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition , 2014, Nucleic acids research.

[66]  Wei Chen,et al.  Identification of apolipoprotein using feature selection technique , 2016, Scientific Reports.

[67]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[68]  Bo Jiang,et al.  Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naïve Bayes , 2014, PloS one.

[69]  K. Chou,et al.  iLoc-Euk: A Multi-Label Classifier for Predicting the Subcellular Localization of Singleplex and Multiplex Eukaryotic Proteins , 2011, PloS one.

[70]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[71]  K. Chou,et al.  iSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins , 2013, PeerJ.

[72]  K. Chou,et al.  Prediction of protein signal sequences and their cleavage sites , 2001, Proteins.

[73]  Dechang Pi,et al.  iRSpot-SPI: Deep learning-based recombination spots prediction by incorporating secondary sequence information coupled with physio-chemical properties via Chou's 5-step rule and pseudo components , 2019, Chemometrics and Intelligent Laboratory Systems.

[74]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[75]  Wei Chen,et al.  iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition , 2013, Nucleic acids research.

[76]  Liang Kong,et al.  i6mA-DNCP: Computational Identification of DNA N6-Methyladenine Sites in the Rice Genome Using Optimized Dinucleotide-Based Features , 2019, Genes.

[77]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[78]  Liang Kong,et al.  iRSpot-ADPM: Identify recombination spots by incorporating the associated dinucleotide product model into Chou's pseudo components. , 2018, Journal of theoretical biology.

[79]  Xin Wang,et al.  PseAAC-Builder: a cross-platform stand-alone program for generating various special Chou's pseudo-amino acid compositions. , 2012, Analytical biochemistry.

[80]  Mohamed Medhat Gaber,et al.  Random forests: from early developments to recent advancements , 2014 .

[81]  M. DePristo,et al.  Deep learning of genomic variation and regulatory network data. , 2018, Human molecular genetics.

[82]  Prosenjit Paul,et al.  Recombination hotspots: Models and tools for detection. , 2016, DNA repair.

[83]  Xiao Sun,et al.  Support vector machine for classification of meiotic recombination hotspots and coldspots in Saccharomyces cerevisiae based on codon composition , 2005, BMC Bioinformatics.

[84]  Byunghan Lee,et al.  Deep learning in bioinformatics , 2016, Briefings Bioinform..

[85]  Muhammad Kabir,et al.  Predicting DNase I hypersensitive sites via un-biased pseudo trinucleotide composition , 2017 .

[86]  Morteza Mohammad Noori,et al.  Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features , 2014, PLoS Comput. Biol..

[87]  Weifeng Li,et al.  Comparing Machine Learning Classifiers for Object-Based Land Cover Classification Using Very High Resolution Imagery , 2014, Remote. Sens..

[88]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[89]  Jia Liu,et al.  Sequence-dependent prediction of recombination hotspots in Saccharomyces cerevisiae. , 2012, Journal of theoretical biology.

[90]  Martin Kappas,et al.  Comparison of Random Forest, k-Nearest Neighbor, and Support Vector Machine Classifiers for Land Cover Classification Using Sentinel-2 Imagery , 2017, Sensors.

[91]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[92]  Fan Yang,et al.  iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC , 2018, Bioinform..

[93]  Bin Liu,et al.  Pse-in-One 2.0: An Improved Package of Web Servers for Generating Various Modes of Pseudo Components of DNA, RNA, and Protein Sequences , 2017 .

[94]  WestonJason,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002 .

[95]  Liang Kong,et al.  iRSpot-PDI: Identification of recombination spots by incorporating dinucleotide property diversity information into Chou's pseudo components. , 2019, Genomics.

[96]  Xuegong Zhang,et al.  Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data , 2006, BMC Bioinformatics.

[97]  De-Shuang Huang,et al.  High-Order Convolutional Neural Network Architecture for Predicting DNA-Protein Binding Sites , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[98]  K. Chou,et al.  Prediction of linear B-cell epitopes using amino acid pair antigenicity scale , 2007, Amino Acids.

[99]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[100]  Alex Zhavoronkov,et al.  Applications of Deep Learning in Biomedicine. , 2016, Molecular pharmaceutics.

[101]  B. Liu,et al.  iRSpot-DACC: a computational predictor for recombination hot/cold spots identification based on dinucleotide-based auto-cross covariance , 2016, Scientific Reports.

[102]  Chengqi Zhang,et al.  Cost-Sensitive Classification with k-Nearest Neighbors , 2013, KSEM.

[103]  Hong-Bin Shen,et al.  TargetFreeze: Identifying Antifreeze Proteins via a Combination of Weights using Sequence Evolutionary Information and Pseudo Amino Acid Composition , 2015, The Journal of Membrane Biology.