Enhanced prediction of recombination hotspots using input features extracted by class specific autoencoders.

In yeast and in some mammals the frequencies of recombination are high in some genomic locations which are known as recombination hotspots and in the locations where the recombination is below average are consequently known as coldspots. Knowledge of the hotspot regions gives clues about understanding the meiotic process and also in understanding the possible effects of sequence variation in these regions. Moreover, accurate information about the hotspot and coldspot regions can reveal insights into the genome evolution. In the present work, we have used class specific autoencoders for feature extraction and reduction. Subsequently the deep features that are extracted from the autoencoders were used to train three different classifiers, namely: gradient boosting machines, random forest and deep learning neural networks for predicting the hotspot and coldspot regions. A comparative performance analysis was carried out by experimenting on deep features extracted from different sets of the training data using autoencoders for selecting the best set of deep features. It was observed that learning algorithms trained on features extracted from the combined class specific autoencoder out performed when compared with the performances of these learning algorithms trained with other sets of deep features. So the combined class-specific autoencoder based feature extraction can be applied to a growing range of biological problems to achieve superior prediction performance.

[1]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[2]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[3]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[4]  S. Keeney,et al.  Meiosis-Specific DNA Double-Strand Breaks Are Catalyzed by Spo11, a Member of a Widely Conserved Protein Family , 1997, Cell.

[5]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[6]  Pierre Baldi,et al.  Autoencoders, Unsupervised Learning, and Deep Architectures , 2011, ICML Unsupervised and Transfer Learning.

[7]  P. Patel,et al.  Recombination hot spots and human disease. , 1997, Genome research.

[8]  P. Brown,et al.  Global mapping of meiotic recombination hotspots and coldspots in the yeast Saccharomyces cerevisiae. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Jing Li,et al.  Integration of genomic and epigenomic features to predict meiotic recombination hotspots in human and mouse , 2012, BCB.

[10]  R. Camerini-Otero,et al.  Sensitive mapping of recombination hotspots using sequencing-based detection of ssDNA , 2012, Genome research.

[11]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[12]  Jody Hey,et al.  What's So Hot about Recombination Hotspots? , 2004, PLoS biology.

[13]  Yun S. Song,et al.  Deep Learning for Population Genetic Inference , 2015, bioRxiv.

[14]  Fuzhen Zhuang,et al.  Supervised Representation Learning with Double Encoding-Layer Autoencoder for Transfer Learning , 2017, ACM Trans. Intell. Syst. Technol..

[15]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[16]  S. Cebrat,et al.  Distribution of Recombination Hotspots in the Human Genome – A Comparison of Computer Simulations with Real Data , 2013, PloS one.

[17]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[18]  Xinghua Lu,et al.  Learning a hierarchical representation of the yeast transcriptomic machinery using an autoencoder model , 2016, BMC Bioinformatics.

[19]  Stan Matwin,et al.  Learning When Negative Examples Abound , 1997, ECML.

[20]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[21]  Ren Long,et al.  iRSpot-EL: identify recombination spots with an ensemble learning approach , 2017, Bioinform..

[22]  W. Zhu,et al.  An independent study of two deep learning platforms - H2O and SINGA , 2016, 2016 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM).

[23]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[24]  Mark D. McDonnell,et al.  Deep extreme learning machines: supervised autoencoding architecture for classification , 2016, Neurocomputing.

[25]  A. Nicolas,et al.  An atypical topoisomerase II from archaea with implications for meiotic recombination , 1997, Nature.

[26]  Andrew J. Grimm,et al.  Identifying Recombination Hot Spots in the HIV-1 Genome , 2013, Journal of Virology.

[27]  Charles X. Ling,et al.  AUC: A Better Measure than Accuracy in Comparing Learning Algorithms , 2003, Canadian Conference on AI.

[28]  A. Futschik,et al.  A Fast Estimate for the Population Recombination Rate Based on Regression , 2013, Genetics.

[29]  Jia Liu,et al.  Sequence-dependent prediction of recombination hotspots in Saccharomyces cerevisiae. , 2012, Journal of theoretical biology.

[30]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[31]  Wei Chen,et al.  iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition , 2013, Nucleic acids research.

[32]  A. Nicolas,et al.  Clustering of meiotic double-strand breaks on yeast chromosome III. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Ethan L. Stewart,et al.  The Impact of Recombination Hotspots on Genome Evolution of a Fungal Plant Pathogen , 2015, Genetics.

[34]  Xiaolong Wang,et al.  repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects , 2015, Bioinform..

[35]  Ian H. Witten,et al.  Chapter 10 – Deep learning , 2017 .

[36]  Zuhong Lu,et al.  Capturing Cryptosporidium. , 1996, Nucleic Acids Res..

[37]  John Q. Gan,et al.  Class-specific pre-trained sparse autoencoders for learning effective features for document classification , 2016, 2016 8th Computer Science and Electronic Engineering (CEEC).

[38]  K. Chou,et al.  iRSpot-TNCPseAAC: Identify Recombination Spots with Trinucleotide Composition and Pseudo Amino Acid Components , 2014, International journal of molecular sciences.

[39]  T. Petes,et al.  Meiotic recombination hot spots and cold spots , 2001, Nature Reviews Genetics.

[40]  Takehisa Yairi,et al.  Anomaly Detection Using Autoencoders with Nonlinear Dimensionality Reduction , 2014, MLSDA'14.

[41]  M Lichten,et al.  Meiosis-induced double-strand break sites determined by yeast chromatin structure. , 1994, Science.

[42]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[43]  Charles Elkan,et al.  Optimal Thresholding of Classifiers to Maximize F1 Measure , 2014, ECML/PKDD.

[44]  A. Goldman,et al.  Meiotic recombination hotspots. , 1995, Annual review of genetics.

[45]  Jeffrey Shaman,et al.  Forecasting Influenza Epidemics in Hong Kong , 2015, PLoS Comput. Biol..

[46]  Alois Knoll,et al.  Gradient boosting machines, a tutorial , 2013, Front. Neurorobot..