DNA sequence classification based on MLP with PILAE algorithm

In the bioinformatics field, the classification of unknown biological sequences is a key task that is fundamental for simplifying the consistency, aggregation, and survey of organisms and their evolution. We can view biological sequences as data components of higher non-fixed dimensions, corresponding to the length of the sequences. Numerical encoding performs an important function in DNA sequence evaluation via computational procedures such as one-hot encoding (OHE). However, the OHE method has drawbacks: 1) it does not add any details that may produce the additional predictive variable, and 2) if the variable has many classes, then OHE increases the feature space significantly. To overcome these drawbacks, this paper presents a computationally effective framework for classifying DNA sequences of living organisms in the image domain. The proposed strategy relies upon multilayer perceptron trained by a pseudoinverse learning autoencoder (PILAE) algorithm. The PILAE training process does not have to set the learning control parameters or indicate the number of hidden layers. Therefore, the PILAE classifier can accomplish better performance contrasting with other deep neural network (DNNs) strategies such as VGG-16 and Xception models. Experimental results have demonstrated that this proposed strategy achieves high prediction accuracy as well as to a significant degree high computational efficiency over different datasets.

[1]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[2]  Nung Kion Lee,et al.  Evaluation of Convolutionary Neural Networks Modeling of DNA Sequences using Ordinal versus one-hot Encoding Method , 2017 .

[3]  Zhi Wei,et al.  tRNA-DL: A Deep Learning Approach to Improve tRNAscan-SE Prediction Results , 2019, Human Heredity.

[4]  Dongbin Zhao,et al.  Pseudoinverse Learners: New Trend and Applications to Big Data , 2019, INNSBDDL.

[5]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[7]  Megan F. Cole,et al.  Genome-wide Map of Nucleosome Acetylation and Methylation in Yeast , 2005, Cell.

[8]  A. Nandy,et al.  Novel techniques of graphical representation and analysis of DNA sequences—A review , 1998, Journal of Biosciences.

[9]  Qian Yin,et al.  Image Recognition with Histogram of Oriented Gradient Feature and Pseudoinverse Learning AutoEncoders , 2017, ICONIP.

[10]  Tony Håndstad,et al.  Motif kernel generated by genetic programming improves remote homology and fold detection , 2007, BMC Bioinformatics.

[11]  Stéphane Mallat,et al.  Rotation, Scaling and Deformation Invariant Scattering for Texture Discrimination , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  P. Hebert,et al.  bold: The Barcode of Life Data System (http://www.barcodinglife.org) , 2007, Molecular ecology notes.

[13]  Renfa Li,et al.  On the Similarity of DNA Primary Sequences Based on 5-D Representation , 2007 .

[14]  Ning Chen,et al.  DeepEnhancer: Predicting enhancers by convolutional neural networks , 2016, 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[15]  Klaus Obermayer,et al.  Fast model-based protein homology detection without alignment , 2007, Bioinform..

[16]  Jiuwen Cao,et al.  Protein Sequence Classification with Improved Extreme Learning Machine Algorithms , 2014, BioMed research international.

[17]  C. L. Philip Chen,et al.  Regularization parameter estimation for feedforward neural networks , 2003 .

[18]  Ren Long,et al.  iRSpot-EL: identify recombination spots with an ensemble learning approach , 2017, Bioinform..

[19]  Yann LeCun,et al.  Very Deep Convolutional Networks for Text Classification , 2016, EACL.

[20]  Jijun Tang,et al.  Prediction of human protein subcellular localization using deep learning , 2017, J. Parallel Distributed Comput..

[21]  Ren Long,et al.  iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework , 2016, Bioinform..

[22]  P. Hebert,et al.  bold: The Barcode of Life Data System (http://www.barcodinglife.org) , 2007, Molecular ecology notes.

[23]  Alaa Eddin Alchalabi,et al.  Taxonomic Classification for Living Organisms Using Convolutional Neural Networks , 2017, Genes.

[24]  S. Park,et al.  Deep transfer learning approach to predict tumor mutation burden (TMB) and delineate spatial heterogeneity of TMB within tumors from whole slide images , 2019, bioRxiv.

[25]  Xiaolong Wang,et al.  repRNA: a web server for generating various feature vectors of RNA sequences , 2015, Molecular Genetics and Genomics.

[26]  Jianlin Cheng,et al.  DNdisorder: predicting protein disorder using boosting and deep networks , 2013, BMC Bioinformatics.

[27]  Junjie Chen,et al.  Protein remote homology detection based on bidirectional long short-term memory , 2017, BMC Bioinformatics.

[28]  P. Hebert,et al.  The promise of DNA barcoding for taxonomy. , 2005, Systematic biology.

[29]  Kenji Satou,et al.  DNA Sequence Classification by Convolutional Neural Network , 2016 .

[30]  Tomasz Neugebauer,et al.  DNA Data Visualization (DDV): Software for Generating Web-Based Interfaces Supporting Navigation and Analysis of DNA Sequence Data of Entire Genomes , 2015, PloS one.

[31]  B. Liu,et al.  Pse-Analysis: a python package for DNA/RNA and protein/peptide sequence analysis based on pseudo components and kernel methods , 2017, Oncotarget.

[32]  David K. Gifford,et al.  Convolutional neural network architectures for predicting DNA–protein binding , 2016, Bioinform..

[33]  Avanti Shrikumar,et al.  Reverse-complement parameter sharing improves deep learning models for genomics , 2017, bioRxiv.

[34]  N. Radakovich,et al.  Spatial heterogeneity and organization of tumor mutation burden with immune infiltrates within tumors based on whole slide images correlated with patient survival in bladder cancer , 2019, Journal of pathology informatics.

[35]  Antonino Fiannaca,et al.  Probabilistic topic modeling for the analysis and classification of genomic sequences , 2015, BMC Bioinformatics.

[36]  Atina G. Coté,et al.  Evaluation of methods for modeling transcription factor sequence specificity , 2013, Nature Biotechnology.

[37]  Antonino Fiannaca,et al.  Classification Experiments of DNA Sequences by Using a Deep Neural Network and Chaos Game Representation , 2016, CompSysTech.

[38]  Tu Bao Ho,et al.  Prediction of Histone Modifications in DNA sequences , 2007, 2007 IEEE 7th International Symposium on BioInformatics and BioEngineering.

[39]  David R. Kelley,et al.  Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks , 2015, bioRxiv.

[40]  Beilun Wang,et al.  Deep Motif Dashboard: Visualizing and Understanding Genomic Sequences Using Deep Neural Networks , 2016, PSB.

[41]  Giovanni Felici,et al.  Learning to classify species with barcodes , 2009, BMC Bioinformatics.

[42]  Michael R. Lyu,et al.  A pseudoinverse learning algorithm for feedforward neural networks with stacked generalization applications to software reliability growth data , 2004, Neurocomputing.

[43]  Antonino Fiannaca,et al.  The General Regression Neural Network to Classify Barcode and mini-barcode DNA , 2014, CIBB.

[44]  Ke Wang,et al.  Autoencoder, low rank approximation and pseudoinverse learning algorithm , 2017, 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[45]  Avanti Shrikumar,et al.  Separable Fully Connected Layers Improve Deep Learning Models For Genomics , 2017, bioRxiv.

[46]  G. Bejerano,et al.  Enhancers: five essential questions , 2013, Nature Reviews Genetics.

[47]  Morteza Mohammad Noori,et al.  Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features , 2014, PLoS Comput. Biol..

[48]  Alexander J. Smola,et al.  Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[49]  Ehsaneddin Asgari,et al.  ProtVec: A Continuous Distributed Representation of Biological Sequences , 2015, ArXiv.

[50]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[51]  W. Wasserman,et al.  Genome-wide prediction of cis-regulatory regions using supervised deep learning methods , 2016, BMC Bioinformatics.

[52]  Giovanni Felici,et al.  Supervised DNA Barcodes species classification: analysis, comparisons and results , 2014, BioData Mining.

[53]  Jie Zhang,et al.  Reveal the Cognitive Process of Deep Learning during Identifying Nucleosome Occupancy and Histone Modification , 2018, 2018 Chinese Automation Congress (CAC).

[54]  Kenji Satou,et al.  Application of a Feature Selection Method to Nucleosome Data: Accuracy Improvement and Comparison with Other Methods , 2008 .

[55]  D. Bielinska-Waz,et al.  Non-standard similarity/dissimilarity analysis of DNA sequences. , 2014, Genomics.

[56]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[57]  Aurélien Miralles,et al.  The integrative future of taxonomy , 2010, Frontiers in Zoology.

[58]  De-Shuang Huang,et al.  Modeling in-vivo protein-DNA binding by combining multiple-instance learning with a hybrid deep neural network , 2019, Scientific Reports.

[59]  John C. Sanford,et al.  Skittle: A 2-Dimensional Genome Visualization Tool , 2009, BMC Bioinformatics.

[60]  Sander M. Bohte,et al.  An image representation based convolutional network for DNA classification , 2018, ICLR.

[61]  Qinghua Hu,et al.  HAlign: Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy , 2015, Bioinform..