ezGeno: An Automatic Model Selection Package for Genomic Data Analysis

To facilitate the process of tailor-making a deep neural network for exploring the dynamics of genomic DNA, we have developed a hands-on package called ezGeno that automates the search process of various parameters and network structure. ezGeno considers three different sets of search spaces, namely, the number of filters, dilation factors, and the connectivity between different layers. ezGeno can be applied to any kind of 1D genomic input such as genomic sequences, histone modifications, DNase feature data and so on. Combinations of multiple abovementioned 1D features are also applicable. Specifically, for the task of predicting TF binding using genomic sequences as the input, ezGeno can consistently return the best performing set of parameters and network structure, as well as highlight the important segments within the original sequences. For the task of predicting tissue-specific enhancer activity using both sequence and DNase feature data as the input, ezGeno also regularly outperforms the hand-designed models. In this study, we demonstrate that ezGeno is superior in efficiency and accuracy when compared to AutoKeras, a general open-source AutoML package. The average AUC of ezGeno is also consistently higher than the result of using a one-layer DeepBind model. With the flexibility of ezGeno, we expect that this package can provide future researchers not only support of model design in their analysis of genomic studies but also more insights into the regulatory landscape. Availability The ezGeno package can be freely accessed at https://github.com/ailabstw/ezGeno. Contact Dr. Chien-Yu Chen, chienyuchen@ntu.edu.tw

[1]  Yiming Yang,et al.  DARTS: Differentiable Architecture Search , 2018, ICLR.

[2]  Marc D. Perry,et al.  ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia , 2012, Genome research.

[3]  J. Lupski,et al.  Non-coding genetic variants in human disease. , 2015, Human molecular genetics.

[4]  Xiangyu Zhang,et al.  Single Path One-Shot Neural Architecture Search with Uniform Sampling , 2019, ECCV.

[5]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Quoc V. Le,et al.  Efficient Neural Architecture Search via Parameter Sharing , 2018, ICML.

[7]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[8]  F. A. Kolpakov,et al.  HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis , 2017, Nucleic Acids Res..

[9]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[10]  Xiaohui S. Xie,et al.  DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences , 2015, bioRxiv.

[11]  Qingquan Song,et al.  Auto-Keras: An Efficient Neural Architecture Search System , 2018, KDD.

[12]  E. Mardis The impact of next-generation sequencing technology on genetics. , 2008, Trends in genetics : TIG.

[13]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[14]  Beilun Wang,et al.  Deep Motif Dashboard: Visualizing and Understanding Genomic Sequences Using Deep Neural Networks , 2016, PSB.

[15]  Quoc V. Le,et al.  Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[16]  Bo Chen,et al.  MnasNet: Platform-Aware Neural Architecture Search for Mobile , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  D. Goldstein,et al.  Uncovering the roles of rare variants in common disease through whole-genome sequencing , 2010, Nature Reviews Genetics.

[18]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[19]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[20]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[21]  Mikael Bodén,et al.  MEME Suite: tools for motif discovery and searching , 2009, Nucleic Acids Res..

[22]  David R. Kelley,et al.  Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks , 2015, bioRxiv.