Adaptive Somatic Mutations Calls with Deep Learning and Semi-Simulated Data

A number of approaches have been developed to call somatic variation in high-throughput sequencing data. Here, we present an adaptive approach to calling somatic variations. Our approach trains a deep feed-forward neural network with semi-simulated data. Semi-simulated datasets are constructed by planting somatic mutations in real datasets where no mutations are expected. Using semi-simulated data makes it possible to train the models with millions of training examples, a usual requirement for successfully training deep learning models. We initially focus on calling variations in RNA-Seq data. We derive semi-simulated datasets from real RNA-Seq data, which offer a good representation of the data the models will be applied to. We test the models on independent semi-simulated data as well as pure simulations. On independent semi-simulated data, models achieve an AUC of 0.973. When tested on semi-simulated exome DNA datasets, we find that the models trained on RNA-Seq data remain predictive (sens 0.4 & spec 0.9 at cutoff of P > = 0.9), albeit with lower overall performance (AUC=0.737). Interestingly, while the models generalize across assay, training on RNA-Seq data lowers the confidence for a group of mutations. Haloplex exome specific training was also performed, demonstrating that the approach can produce probabilistic models tuned for specific assays and protocols. We found that the method adapts to the characteristics of experimental protocol. We further illustrate these points by training a model for a trio somatic experimental design when germline DNA of both parents is available in addition to data about the individual. These models are distributed with Goby (http://goby.campagnelab.org).

[1]  Kevin C. Dorff,et al.  The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models , 2010, Nature Biotechnology.

[2]  Michael R. Speicher,et al.  A survey of tools for variant analysis of next-generation genome sequencing data , 2013, Briefings Bioinform..

[3]  Kevin C. Dorff,et al.  GobyWeb: Simplified Management and Analysis of Gene Expression and DNA Methylation Sequencing Data , 2013, PloS one.

[4]  Y. Shyr,et al.  Practicability of detecting somatic point mutation from RNA high throughput sequencing data. , 2016, Genomics.

[5]  Peilin Jia,et al.  Detecting somatic point mutations in cancer genome sequencing data: a comparison of mutation callers , 2013, Genome Medicine.

[6]  F. Campagne,et al.  MetaR: simple, high-level languages for data analysis with the R ecosystem , 2015 .

[7]  S. Ranade,et al.  Stem cell transcriptome profiling via massive-scale mRNA sequencing , 2008, Nature Methods.

[8]  James T. Robinson,et al.  Compression of Structured High-Throughput Sequencing Data , 2013, PloS one.

[9]  Eric Rondeau,et al.  Exome Sequencing and Prediction of Long-Term Kidney Allograft Function , 2015, bioRxiv.

[10]  Michael C. Heinold,et al.  A comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing , 2015, Nature Communications.

[11]  Rich Caruana,et al.  Predicting good probabilities with supervised learning , 2005, ICML.

[12]  O. Stegle,et al.  Deep learning for computational biology , 2016, Molecular systems biology.