论文信息 - Benchmarking Reverse-Complement Strategies for Deep Learning Models in Genomics

Benchmarking Reverse-Complement Strategies for Deep Learning Models in Genomics

Predictive models that map double-stranded regulatory DNA to molecular signals of regulatory activity should, in principle, produce identical predictions regardless of whether the sequence of the forward strand or its reverse complement (RC) is supplied as input. Unfortunately, standard convolutional neural network architectures can produce highly divergent predictions across strands, even when the training set is augmented with RC sequences. Two strategies have emerged in the literature to enforce this symmetry: conjoined a.k.a. “siamese” architectures where the model is run in parallel on both strands & predictions are combined, and RC parameter sharing or RCPS where weight sharing ensures that the response of the model is equivariant across strands. However, systematic benchmarks are lacking, and neither architecture has been adapted to base-resolution signal profile prediction tasks. In this work, we extend conjoined and RCPS models to signal profile prediction, and introduce a strong baseline: a standard model (trained on RC augmented data) that is converted to a conjoined model only after it has been trained, which we call a “post-hoc” conjoined model. We then conduct benchmarks on both binary and signal profile prediction. We find post-hoc conjoined models consistently perform as well as or better than models that were conjoined during training, and present a mathematical intuition for why. We also find that - despite its theoretical appeal - RCPS performs surprisingly poorly on certain tasks, in particular, signal profile prediction. In fact, RCPS can sometimes do worse than even standard models trained with RC data augmentation. We prove that the RCPS models can represent the solution learned by the conjoined models, implying that the poor performance of RCPS may be due to optimization difficulties. We therefore suggest that users interested in RC symmetry should default to post-hoc conjoined models as a reliable baseline before exploring RCPS. Code: https://github.com/hannahgz/BenchmarkRCStrategies

[1] B. Frey,et al. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[2] Marc D. Perry,et al. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia , 2012, Genome research.

[3] Richard Brown,et al. An equivariant Bayesian convolutional network predicts recombination hotspots and accurately resolves binding motifs , 2018, bioRxiv.

[4] A. Pozner,et al. PAtCh-Cap: input strategy for improving analysis of ChIP-exo data sets and beyond , 2016, Nucleic acids research.

[5] ENCODEConsortium,et al. An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[6] O. Troyanskaya,et al. Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[7] Bernhard Y Renard,et al. DeePaC: predicting pathogenic potential of novel DNA with reverse-complement neural networks , 2020, Bioinform..

[8] Daniel Quang,et al. FactorNet: a deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data , 2017, bioRxiv.

[9] Max Welling,et al. Group Equivariant Convolutional Networks , 2016, ICML.

[10] Avanti Shrikumar,et al. Reverse-complement parameter sharing improves deep learning models for genomics , 2017, bioRxiv.

[11] Manolis Kellis,et al. Systematic discovery and characterization of regulatory motifs in ENCODE TF binding experiments , 2013, Nucleic acids research.

[12] Data production leads,et al. An integrated encyclopedia of DNA elements in the human genome , 2012 .

[13] David R. Kelley,et al. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks , 2015, bioRxiv.

[14] Michael Q. Zhang,et al. Integrative analysis of 111 reference human epigenomes , 2015, Nature.

[15] Avanti Shrikumar,et al. Base-resolution models of transcription factor binding reveal soft motif syntax , 2019, Nature Genetics.