DCDE: An Efficient Deep Convolutional Divergence Encoding Method for Human Promoter Recognition

Efficient human promoter feature extraction is still a major challenge in genome analysis as it can better understand human gene regulation and will be useful for experimental guidance. Although many machine learning algorithms have been developed for eukaryotic gene recognition, performance on promoters is unsatisfactory due to the diverse nature. To extract discriminative features from human promoters, an efficient deep convolutional divergence encoding method (DCDE) is proposed based on statistical divergence (SD) and convolutional neural network (CNN). SD can help optimize kmer feature extraction for human promoters. CNN can also be used to automatically extract features in gene analysis. In DCDE, we first perform informative kmers settlement to encode original gene sequences. A series of SD methods can optimize the most discriminative kmers distributions while maintaining important positional information. Then, CNN is utilized to extract lower dimensional deep features by secondary encoding. Finally, we construct a hybrid recognition architecture with multiple support vector machines and a bilayer decision method. It is flexible to add new features or new models and can be extended to identify other genomic functional elements. The extensive experiments demonstrate that DCDE is effective in promoter encoding and can significantly improve the performance of promoter recognition.

[1]  De-Shuang Huang,et al.  Normalized Feature Vectors: A Novel Alignment-Free Sequence Comparison Method Based on the Numbers of Adjacent Amino Acids , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[2]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[3]  Li Zhang,et al.  SD-MSAEs: Promoter recognition in human genome based on deep feature extraction , 2016, J. Biomed. Informatics.

[4]  Kai Tan,et al.  Discover regulatory DNA elements using chromatin signatures and artificial neural network , 2010, Bioinform..

[5]  De-Shuang Huang,et al.  ChIP-PIT: Enhancing the Analysis of ChIP-Seq Data Using Convex-Relaxed Pair-Wise Interaction Tensor Decomposition , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[6]  Lei Zhang,et al.  Prediction of protein-protein interactions based on protein-protein correlation using least squares regression. , 2014, Current protein & peptide science.

[7]  Haibin Ling,et al.  A Deep Network Solution for Attention and Aesthetics Aware Photo Cropping , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  D. Brutlag,et al.  A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Iraj Daizadeh,et al.  EID: the Exon?Intron Database?an exhaustive database of protein-coding intron-containing genes , 2000, Nucleic Acids Res..

[10]  D.-S. Huang,et al.  Radial Basis Probabilistic Neural Networks: Model and Application , 1999, Int. J. Pattern Recognit. Artif. Intell..

[11]  Hong Yan,et al.  Eukaryotic promoter prediction based on relative entropy and positional information. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[12]  Susana Vinga,et al.  Information theory applications for biological sequence analysis , 2013, Briefings Bioinform..

[13]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[14]  Lei Zhang,et al.  Tumor Clustering Using Nonnegative Matrix Factorization With Gene Selection , 2009, IEEE Transactions on Information Technology in Biomedicine.

[15]  Jürgen Schmidhuber,et al.  Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Christina S. Leslie,et al.  SeqGL Identifies Context-Dependent Binding Signals in Genome-Wide Regulatory Element Maps , 2015, PLoS Comput. Biol..

[17]  Wenguan Wang,et al.  Deep Visual Attention Prediction , 2017, IEEE Transactions on Image Processing.

[18]  Caroline Smith,et al.  Establishing glucose- and ABA-regulated transcription networks in Arabidopsis by microarray analysis and promoter classification using a Relevance Vector Machine. , 2006, Genome research.

[19]  Vladimir B. Bajic,et al.  Comparing the Success of Different Prediction Software in Sequence Analysis: A Review , 2000, Briefings Bioinform..

[20]  Sandeep K. Kushwaha,et al.  NBSPred: a support vector machine-based high-throughput pipeline for plant resistance protein NBSLRR prediction , 2016, Bioinform..

[21]  Hong Yan,et al.  SCS: Signal, Context, and Structure Features for Genome-Wide Human Promoter Recognition , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[22]  Graziano Pesole,et al.  UTRdb and UTRsite: specialized databases of sequences and functional elements of 5' and 3' untranslated regions of eukaryotic mRNAs , 2000, Nucleic Acids Res..

[23]  Frank Nielsen,et al.  Sided and Symmetrized Bregman Centroids , 2009, IEEE Transactions on Information Theory.

[24]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[25]  Hong Yan,et al.  PromoterExplorer: an effective promoter identification method based on the AdaBoost algorithm , 2006, Bioinform..

[26]  Ling Shao,et al.  Video Salient Object Detection via Fully Convolutional Networks , 2017, IEEE Transactions on Image Processing.

[27]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[28]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[29]  De-Shuang Huang,et al.  A Two-Stage Geometric Method for Pruning Unreliable Links in Protein-Protein Networks , 2015, IEEE Transactions on NanoBioscience.

[30]  De-Shuang Huang,et al.  Predicting Hub Genes Associated with Cervical Cancer through Gene Co-Expression Networks , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[31]  Ray Walshe,et al.  Pol II promoter prediction using characteristic 4-mer motifs: a machine learning approach , 2008, BMC Bioinformatics.

[32]  Jianbing Shen,et al.  Triplet Loss in Siamese Network for Object Tracking , 2018, ECCV.

[33]  Rui Yan,et al.  A tree-based approach for motif discovery and sequence classification , 2011, Bioinform..

[34]  Kenta Nakai,et al.  DBTSS as an integrative platform for transcriptome, epigenome and genome sequence variation data , 2014, Nucleic Acids Res..

[35]  Vladimir B. Bajic,et al.  An Intelligent System for Vertebrate Promoter Recognition , 2002, IEEE Intell. Syst..

[36]  Liaofu Luo,et al.  Prediction for human transcription start site using diversity measure with quadratic discriminant , 2008, Bioinformation.

[37]  Graziano Pesole,et al.  UTRdb and UTRsite: specialized databases of sequences and functional elements of 5' and 3' untranslated regions of eukaryotic mRNAs. Update 2002 , 2002, Nucleic Acids Res..

[38]  Jin Zhang,et al.  Promoter recognition based on the maximum entropy hidden Markov model , 2014, Comput. Biol. Medicine.

[39]  Leelavati Narlikar,et al.  No Promoter Left Behind (NPLB): learn de novo promoter architectures from genome-wide transcription start sites , 2015, Bioinform..

[40]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[41]  De-Shuang Huang,et al.  Independent component analysis-based penalized discriminant method for tumor classification using gene expression data , 2006, Bioinform..

[42]  Dongwon Lee,et al.  LS-GKM: a new gkm-SVM for large-scale datasets , 2016, Bioinform..

[43]  V. Solovyev,et al.  Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks , 2016, PloS one.

[44]  Yanjun Qi,et al.  DeepChrome: deep-learning for predicting gene expression from histone modifications , 2016, Bioinform..

[45]  Umesh P,et al.  A novel sequence and context based method for promoter recognition , 2014, Bioinformation.

[46]  De-Shuang Huang,et al.  Mining the bladder cancer-associated genes by an integrated strategy for the construction and analysis of differential co-expression networks , 2015, BMC Genomics.

[47]  Han Liang,et al.  Methylated CpG site count of dapper homolog 1 (DACT1) promoter prediction the poor survival of gastric cancer. , 2014, American journal of cancer research.

[48]  Hong Yan,et al.  Human Promoter Recognition using Kullback-Leibler Divergence , 2007, 2007 International Conference on Machine Learning and Cybernetics.

[49]  Sharmistha Chatterjee,et al.  Information-theoretic algorithms in bioinformatics and bio-/medical-imaging: A review , 2011, 2011 International Conference on Recent Trends in Information Technology (ICRTIT).

[50]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[51]  Hong Yan,et al.  Towards accurate human promoter recognition: a review of currently used sequence features and classification methods , 2009, Briefings Bioinform..

[52]  De-Shuang Huang,et al.  A Constructive Hybrid Structure Optimization Methodology for Radial Basis Probabilistic Neural Networks , 2008, IEEE Transactions on Neural Networks.

[53]  Shuigeng Zhou,et al.  A pattern-based nearest neighbor search approach for promoter prediction using DNA structural profiles , 2009, Bioinform..