Predicting transcription factor binding sites by dual-stream multiple instance learning network

The discovery of transcription factor binding sites(TFBSs) is important for modeling potential binding mechanisms and subsequent cellular functions. In recent years, there have been many deep learning methods that have achieved good results in predicting transcription factor binding sites. However, these methods usually follow the fully supervised approach and ignore the weakly supervised information in DNA sequences. In contrast, the currently proposed multiple instance learing(MIL) methods based on weakly supervised learning usually divide the DNA sequence into multiple overlapping subsequences and model each instance separately. These methods do not take into account the connections between overlapping subsequences, and these methods destroy the global information of the sequences in the process of dividing them into overlapping subsequences. In addition, deep learning methods generally perform poorly when there is less training data. We, therefore, propose a new deep learning method, DS-SSB. More specifically, DS-SSB combines the dual-stream multiple instance network with multiple features. First, we combine sequence features and shape features after feature extraction at the instance level to enhance the feature representation of instances. Then, the instance embeddings are aggregated into bag embedding through the dual-stream multiple instance network, and the relationships between the instances are considered in the aggregation process. Finally, the instance features fused into bag features are fused together with the BERT features of the whole sequence at the bag level for the final prediction. Experiments conducted on 690 ChIP-seq datasets showed that DS-SSB achieved good performance in predicting TFBSs. Also, experiments on four datasets further show that our method has an advantage on small datasets as well

[1]  OUP accepted manuscript , 2022, Briefings In Bioinformatics.

[2]  Frans Coenen,et al.  Weakly supervised learning of RNA modifications from low-resolution epitranscriptome data , 2021, Bioinform..

[3]  Xiuquan Du,et al.  Using Chou's 5-Step Rule to Predict DNA-Protein Binding with Multi-scale Complementary Feature. , 2021, Journal of proteome research.

[4]  K. Eliceiri,et al.  Dual-stream Multiple Instance Learning Network for Whole Slide Image Classification with Self-supervised Contrastive Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  OUP accepted manuscript , 2021, Briefings In Bioinformatics.

[6]  Zhihan Zhou,et al.  DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome , 2020, bioRxiv.

[7]  De-Shuang Huang,et al.  Weakly-Supervised Convolutional Neural Network Architecture for Predicting Protein-DNA Binding , 2020, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[8]  Phillip A. Richmond,et al.  JASPAR 2020: update of the open-access database of transcription factor binding profiles , 2019, Nucleic Acids Res..

[9]  De-shuang Huang,et al.  Modeling in-vivo protein-DNA binding by combining multiple-instance learning with a hybrid deep neural network , 2019, Scientific Reports.

[10]  Minghua Deng,et al.  Expectation pooling: an effective and interpretable pooling method for predicting DNA–protein binding , 2019, bioRxiv.

[11]  De-Shuang Huang,et al.  Recurrent Neural Network for Predicting Transcription Factor Binding Sites , 2018, Scientific Reports.

[12]  In-So Kweon,et al.  CBAM: Convolutional Block Attention Module , 2018, ECCV.

[13]  Zhen Gao,et al.  Computational modeling of in vivo and in vitro protein‐DNA interactions by multiple instance learning , 2017, Bioinform..

[14]  David K. Gifford,et al.  Convolutional neural network architectures for predicting DNA–protein binding , 2016, Bioinform..

[15]  Xiaohui S. Xie,et al.  DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences , 2015, bioRxiv.

[16]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[17]  R. Mann,et al.  Quantitative modeling of transcription factor binding specificities using DNA shape , 2015, Proceedings of the National Academy of Sciences.

[18]  Jason H. Moore,et al.  The ENCODE Project and Perspectives on Pathways , 2014, Genetic epidemiology.

[19]  Wyeth W. Wasserman,et al.  The Next Generation of Transcription Factor Binding Site Prediction , 2013, PLoS Comput. Biol..

[20]  Natalie de Souza The ENCODE project , 2012, Nature Methods.

[21]  Natalie de Souza Genomics: The ENCODE project , 2012, Nature Methods.

[22]  T. Furey ChIP – seq and beyond : new and improved methodologies to detect and characterize protein – DNA interactions , 2012 .

[23]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[24]  F. van Roy,et al.  A flexible integrative approach based on random forest improves prediction of transcription factor binding sites , 2012, Nucleic acids research.

[25]  Anirvan M. Sengupta,et al.  Statistical Mechanics of Transcription-Factor Binding Site Discovery Using Hidden Markov Models , 2010, Journal of statistical physics.

[26]  G. Stormo,et al.  Determining the specificity of protein–DNA interactions , 2010, Nature Reviews Genetics.

[27]  Mark R. Segal,et al.  Identification of Yeast Transcriptional Regulation Networks Using Multivariate Random Forests , 2009, PLoS Comput. Biol..

[28]  Alexander E. Kel,et al.  TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotes , 2005, Nucleic Acids Res..

[29]  Anirvan M. Sengupta,et al.  A biophysical approach to transcription factor binding site discovery. , 2003, Genome research.

[30]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..