Expectation pooling: an effective and interpretable pooling method for predicting DNA–protein binding

Abstract Motivation Convolutional neural networks (CNNs) have outperformed conventional methods in modeling the sequence specificity of DNA–protein binding. While previous studies have built a connection between CNNs and probabilistic models, simple models of CNNs cannot achieve sufficient accuracy on this problem. Recently, some methods of neural networks have increased performance using complex neural networks whose results cannot be directly interpreted. However, it is difficult to combine probabilistic models and CNNs effectively to improve DNA–protein binding predictions. Results In this article, we present a novel global pooling method: expectation pooling for predicting DNA–protein binding. Our pooling method stems naturally from the expectation maximization algorithm, and its benefits can be interpreted both statistically and via deep learning theory. Through experiments, we demonstrate that our pooling method improves the prediction performance DNA–protein binding. Our interpretable pooling method combines probabilistic ideas with global pooling by taking the expectations of inputs without increasing the number of parameters. We also analyze the hyperparameters in our method and propose optional structures to help fit different datasets. We explore how to effectively utilize these novel pooling methods and show that combining statistical methods with deep learning is highly beneficial, which is promising and meaningful for future studies in this field. Availability and implementation All code is public in https://github.com/gao-lab/ePooling. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Rob Fergus,et al.  Stochastic Pooling for Regularization of Deep Convolutional Neural Networks , 2013, ICLR.

[2]  Hong-Bin Shen,et al.  Predicting RNA‐protein binding sites and motifs through combining local and global deep convolutional neural networks , 2018, Bioinform..

[3]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[4]  Wesley De Neve,et al.  SpliceRover: interpretable convolutional neural networks for improved splice site prediction , 2018, Bioinform..

[5]  Meng Wang,et al.  An exact transformation for CNN kernel enables accurate sequence motif identification and leads to a potentially full probabilistic interpretation of CNN , 2019 .

[6]  Holger Karas,et al.  TRANSFAC: a database on transcription factors and their DNA binding sites , 1996, Nucleic Acids Res..

[7]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[8]  Radomír Mech,et al.  Deep Multi-patch Aggregation Network for Image Style, Aesthetics, and Quality Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[9]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[10]  Zhuowen Tu,et al.  Generalizing Pooling Functions in Convolutional Neural Networks: Mixed, Gated, and Tree , 2015, AISTATS.

[11]  Razvan Pascanu,et al.  Learned-Norm Pooling for Deep Feedforward and Recurrent Neural Networks , 2013, ECML/PKDD.

[12]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[13]  Jeremy Buhler,et al.  Finding motifs using random projections , 2001, RECOMB.

[14]  David K. Gifford,et al.  Convolutional neural network architectures for predicting DNA–protein binding , 2016, Bioinform..

[15]  Junchi Yan,et al.  Prediction of RNA-protein sequence and structure binding preferences using deep convolutional and recurrent neural networks , 2017, BMC Genomics.

[16]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[17]  Xi Li,et al.  Stacked Pooling: Improving Crowd Counting by Boosting Scale Invariance , 2018, ArXiv.

[18]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[19]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[20]  Zhen Cao,et al.  Simple tricks of convolutional neural network architectures improve DNA-protein binding prediction , 2018, Bioinform..

[21]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[22]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[23]  Wilfred W. Li,et al.  MEME: discovering and analyzing DNA and protein sequence motifs , 2006, Nucleic Acids Res..

[24]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[25]  H. L. Le Roy,et al.  Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Vol. IV , 1969 .

[26]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[27]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  M. Huss,et al.  A primer on deep learning in genomics , 2018, Nature Genetics.

[29]  Marc'Aurelio Ranzato,et al.  Sparse Feature Learning for Deep Belief Networks , 2007, NIPS.

[30]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[31]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[32]  Uwe Ohler,et al.  SSMART: Sequence-structure motif identification for RNA-binding proteins , 2017, bioRxiv.

[33]  Yu Cheng,et al.  S3Pool: Pooling with Stochastic Spatial Sampling , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[35]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[37]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[38]  Tapani Raiko,et al.  European conference on machine learning and knowledge discovery in databases , 2014 .

[39]  Shuicheng Yan,et al.  Task-Driven Feature Pooling for Image Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[40]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[41]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[42]  William Stafford Noble,et al.  Quantifying similarity between motifs , 2007, Genome Biology.

[43]  Jean Ponce,et al.  A Theoretical Analysis of Feature Pooling in Visual Recognition , 2010, ICML.

[44]  Davide Castelvecchi,et al.  Can we open the black box of AI? , 2016, Nature.

[45]  Neil A. Dodgson,et al.  Proceedings Ninth IEEE International Conference on Computer Vision , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.