A sequence-based approach for identifying recombination spots in Saccharomyces cerevisiae by using hyper-parameter optimization in FastText and support vector machine

Abstract Meiotic recombination is a biological process which plays a crucial role in genetic evolution. Therefore, the ability of machine learning models in extracting desire information embedded in DNA sequences has drawn a great deal of attention among biologists. Recently, several attempts have been made to address this problem, however, the performance results still need to be improved. The current study aims to investigate the relationship between natural language processing model and supervised learning in classifying DNA sequences. The idea is to treat DNA sequences by FastText model, including sub-word information and then use them as features in a suitable supervised learning algorithm. To the end, this hybrid approach helps us classify DNA recombination spots with achieved sensitivity of 90%, specificity of 94.76%, accuracy of 92.6%, and MCC of 0.851. These results have suggested that our newly proposed method is superior to other methods on the same benchmark dataset. This study, therefore, could shed the light on developing the prediction models for recombination spots in particular, and DNA sequences in general.

[1]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[2]  Xiangxiang Zeng,et al.  nDNA-prot: identification of DNA-binding proteins based on unbalanced classification , 2014, BMC Bioinformatics.

[3]  Liang Kong,et al.  iRSpot-ADPM: Identify recombination spots by incorporating the associated dinucleotide product model into Chou's pseudo components. , 2018, Journal of theoretical biology.

[4]  N. Le iN6-methylat (5-step): identifying DNA N6-methyladenine sites in rice genome using continuous bag of nucleobases via Chou’s 5-step rule , 2019, Molecular Genetics and Genomics.

[5]  K. Chou,et al.  PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. , 2014, Analytical biochemistry.

[6]  Dechang Pi,et al.  iRSpot-SPI: Deep learning-based recombination spots prediction by incorporating secondary sequence information coupled with physio-chemical properties via Chou's 5-step rule and pseudo components , 2019, Chemometrics and Intelligent Laboratory Systems.

[7]  Shengli Zhang,et al.  iRSpot-DTS: Predict recombination spots by incorporating the dinucleotide-based spare-cross covariance information into Chou's pseudo components. , 2019, Genomics.

[8]  Zuhong Lu,et al.  Capturing Cryptosporidium. , 1996, Nucleic Acids Res..

[9]  K. Chou,et al.  iRSpot-TNCPseAAC: Identify Recombination Spots with Trinucleotide Composition and Pseudo Amino Acid Components , 2014, International journal of molecular sciences.

[10]  Martin J Lercher,et al.  Human SNP variability and mutation rate are higher in regions of high recombination. , 2002, Trends in genetics : TIG.

[11]  J. Szostak,et al.  Extensive 3′-overhanging, single-stranded DNA associated with the meiosis-specific double-strand breaks at the ARG4 recombination initiation site , 1991, Cell.

[12]  Hao Lv,et al.  Identify origin of replication in Saccharomyces cerevisiae using two-step feature selection technique , 2018, Bioinform..

[13]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[14]  Swakkhar Shatabda,et al.  iRSpot-SF: Prediction of recombination hotspots by incorporating sequence based features into Chou's Pseudo components. , 2019, Genomics.

[15]  Prosenjit Paul,et al.  Recombination hotspots: Models and tools for detection. , 2016, DNA repair.

[16]  Ying Ju,et al.  Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy , 2016, BMC Systems Biology.

[17]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[18]  Xiao Sun,et al.  Support vector machine for classification of meiotic recombination hotspots and coldspots in Saccharomyces cerevisiae based on codon composition , 2005, BMC Bioinformatics.

[19]  M. Lichten,et al.  The location and structure of double‐strand DNA breaks induced during yeast meiosis: evidence for a covalently linked DNA‐protein intermediate. , 1995, The EMBO journal.

[20]  Craig MacDonald,et al.  Using word embeddings in Twitter election classification , 2016, Information Retrieval Journal.

[21]  N. Kleckner,et al.  Identification of joint molecules that form frequently between homologs but rarely between sister chromatids during yeast meiosis , 1994, Cell.

[22]  Chen Lin,et al.  LibD3C: Ensemble classifiers with a clustering and dynamic selection strategy , 2014, Neurocomputing.

[23]  Ehsaneddin Asgari,et al.  Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics , 2015, PloS one.

[24]  Yong-qiang Xing,et al.  Using weighted features to predict recombination hotspots in Saccharomyces cerevisiae. , 2015, Journal of theoretical biology.

[25]  Jiu-Xin Tan,et al.  Identification of hormone binding proteins based on machine learning methods. , 2019, Mathematical biosciences and engineering : MBE.

[26]  Yan Lin,et al.  iTerm-PseKNC: a sequence-based tool for predicting bacterial transcriptional terminators , 2018, Bioinform..

[27]  Weidong Xiao,et al.  Sequence-based identification of recombination spots using pseudo nucleic acid representation and recursive feature extraction by linear kernel SVM , 2014, BMC Bioinformatics.

[28]  Yu-Yen Ou,et al.  iEnhancer-5Step: Identifying enhancers using hidden information of DNA sequences via Chou's 5-step rule and word embedding. , 2019, Analytical biochemistry.

[29]  Muhammad Kabir,et al.  Predicting DNase I hypersensitive sites via un-biased pseudo trinucleotide composition , 2017 .

[30]  Wei Chen,et al.  iRNA-2OM: A Sequence-Based Predictor for Identifying 2′-O-Methylation Sites in Homo sapiens , 2018, J. Comput. Biol..

[31]  Liang Kong,et al.  iRSpot-PDI: Identification of recombination spots by incorporating dinucleotide property diversity information into Chou's pseudo components. , 2019, Genomics.

[32]  A. Nicolas,et al.  The nucleotide mapping of DNA double‐strand breaks at the CYS3 initiation site of meiotic recombination in Saccharomyces cerevisiae. , 1995, The EMBO journal.

[33]  Yu-Yen Ou,et al.  Prediction of FAD binding sites in electron transport proteins according to efficient radial basis function networks and significant amino acid pairs , 2016, BMC Bioinformatics.

[34]  P. Brown,et al.  Global mapping of meiotic recombination hotspots and coldspots in the yeast Saccharomyces cerevisiae. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Jia Liu,et al.  Sequence-dependent prediction of recombination hotspots in Saccharomyces cerevisiae. , 2012, Journal of theoretical biology.

[36]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[37]  Xiaohui Xie,et al.  HLA class I binding prediction via convolutional neural networks , 2017, bioRxiv.

[38]  Gil McVean,et al.  Stable recombination hotspots in birds , 2015, Science.

[39]  B. Liu,et al.  iRSpot-DACC: a computational predictor for recombination hot/cold spots identification based on dinucleotide-based auto-cross covariance , 2016, Scientific Reports.

[40]  David Haussler,et al.  Comparative recombination rates in the rat, mouse, and human genomes. , 2004, Genome research.

[41]  Wei Chen,et al.  iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition , 2013, Nucleic acids research.

[42]  Yu-Yen Ou,et al.  Incorporating deep learning with convolutional neural networks and position specific scoring matrices for identifying electron transport proteins , 2017, J. Comput. Chem..

[43]  Quan Zou,et al.  HPSLPred: An Ensemble Multi‐Label Classifier for Human Protein Subcellular Location Prediction with Imbalanced Source , 2017, Proteomics.

[44]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[45]  N. Gemmell,et al.  The effects of transcription and recombination on mutational dynamics of short tandem repeats , 2017, Nucleic acids research.

[46]  L. Steinmetz,et al.  High-resolution mapping of meiotic crossovers and non-crossovers in yeast , 2008, Nature.

[47]  R. Camerini-Otero,et al.  Sensitive mapping of recombination hotspots using sequencing-based detection of ssDNA , 2012, Genome research.

[48]  Wei Chen,et al.  iDNA4mC: identifying DNA N4‐methylcytosine sites based on nucleotide chemical properties , 2017, Bioinform..

[49]  Wei Chen,et al.  Combining pseudo dinucleotide composition with the Z curve method to improve the accuracy of predicting DNA elements: a case study in recombination spots. , 2016, Molecular bioSystems.

[50]  Iddo Friedberg,et al.  Identifying antimicrobial peptides using word embedding with deep recurrent neural networks , 2018, bioRxiv.

[51]  N. Kleckner,et al.  DMC1: A meiosis-specific yeast homolog of E. coli recA required for recombination, synaptonemal complex formation, and cell cycle progression , 1992, Cell.

[52]  Matthew A. Hibbs,et al.  Affinity-seq detects genome-wide PRDM9 binding sites and reveals the impact of prior chromatin modifications on mammalian recombination hotspot usage , 2015, Epigenetics & Chromatin.

[53]  C. Newlon,et al.  Meiosis-specific formation of joint DNA molecules containing sequences from homologous chromosomes , 1994, Cell.

[54]  Kuo-Chen Chou,et al.  iRSpot-Pse6NC: Identifying recombination spots in Saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC , 2018, International journal of biological sciences.

[55]  Rong Chen,et al.  HBPred: a tool to identify growth hormone-binding proteins , 2018, International journal of biological sciences.

[56]  Ren Long,et al.  iRSpot-EL: identify recombination spots with an ensemble learning approach , 2017, Bioinform..