Evaluation of Convolutionary Neural Networks Modeling of DNA Sequences using Ordinal versus one-hot Encoding Method

Convolutionary neural network (CNN) is a popular choice for supervised DNA motif prediction due to its excellent performances. To employ CNN, the input DNA sequences are required to be encoded as numerical values and represented as either vectors or multi-dimensional matrices. This paper evaluated a simple and more compact ordinal encoding method versus the popular one-hot encoding for DNA sequences. We compared the performances of both encoding methods using three sets of datasets enriched with DNA motifs. We found that the ordinal encoding performs comparable to the one-hot method but with significant reduction in training time. In addition, the one-hot encoding performances were rather consistent across various datasets but would require suitable CNN configuration to perform well. The ordinal encoding with matrix representation performed best in some of the evaluated datasets. This study implied that the performances of CNN for DNA motif discovery depends on the suitable design of the sequence encoding and representation. The good performances of the ordinal encoding method demonstrates that there are still rooms for improvement for the one-hot encoding method.

[1]  Morteza Mohammad Noori,et al.  Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features , 2014, PLoS Comput. Biol..

[2]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[3]  Jianxing Feng,et al.  Imputation for transcription factor binding predictions based on deep learning , 2017, PLoS Comput. Biol..

[4]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[5]  Nung Kion Lee,et al.  Modelling complex features from histone modification signatures using genetic algorithm for the prediction of enhancer region. , 2014, Bio-medical materials and engineering.

[6]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[7]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[8]  David J. Arenillas,et al.  The PAZAR database of gene regulatory information coupled to the ORCA toolkit for the study of regulatory sequences , 2008, Nucleic Acids Res..

[9]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[10]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[11]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[12]  J. Capra,et al.  Short DNA sequence patterns accurately identify broadly active human enhancers , 2017, BMC Genomics.

[13]  O. Stegle,et al.  Deep learning for computational biology , 2016, Molecular systems biology.

[14]  David R. Kelley,et al.  Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks , 2015 .

[15]  Nung Kion Lee,et al.  Comparisons of Enhancers Associated Marks Prediction Using K-mer Feature , 2015 .

[16]  Michael A. Beer,et al.  Discriminative prediction of mammalian enhancers from DNA sequence. , 2011, Genome research.

[17]  Nung Kion Lee,et al.  ENSPART: An Ensemble Framework Based on Data Partitioning for DNA Motif Analysis , 2016, 2016 IEEE 16th International Conference on Bioinformatics and Bioengineering (BIBE).

[18]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19]  Patrick Ng,et al.  dna2vec: Consistent vector representations of variable-length k-mers , 2017, ArXiv.

[20]  O. Stegle,et al.  Accurate prediction of single-cell DNA methylation states using deep learning , 2016, bioRxiv.

[21]  L. Stirling Churchman,et al.  FIDDLE: An integrative deep learning framework for functional genomic data inference , 2016, bioRxiv.

[22]  Michael A. Beer,et al.  Integration of ChIP-seq and machine learning reveals enhancers and a predictive regulatory sequence vocabulary in melanocytes , 2012, Genome research.

[23]  Dianhui Wang,et al.  Neural Networks Applications in Information Technology and Web Engineering , 2005 .

[24]  Magdalena I. Swanson,et al.  PAZAR: a framework for collection and dissemination of cis-regulatory sequence annotation , 2007, Genome Biology.

[25]  David K. Gifford,et al.  Convolutional neural network architectures for predicting DNA–protein binding , 2016, Bioinform..