Strand-seq enables reliable separation of long reads by chromosome via expectation maximization

Motivation Current sequencing technologies are able to produce reads orders of magnitude longer than ever possible before. Such long reads have sparked a new interest in de novo genome assembly, which removes reference biases inherent to re‐sequencing approaches and allows for a direct characterization of complex genomic variants. However, even with latest algorithmic advances, assembling a mammalian genome from long error‐prone reads incurs a significant computational burden and does not preclude occasional misassemblies. Both problems could potentially be mitigated if assembly could commence for each chromosome separately. Results To address this, we show how single‐cell template strand sequencing (Strand‐seq) data can be leveraged for this purpose. We introduce a novel latent variable model and a corresponding Expectation Maximization algorithm, termed SaaRclust, and demonstrates its ability to reliably cluster long reads by chromosome. For each long read, this approach produces a posterior probability distribution over all chromosomes of origin and read directionalities. In this way, it allows to assess the amount of uncertainty inherent to sparse Strand‐seq data on the level of individual reads. Among the reads that our algorithm confidently assigns to a chromosome, we observed more than 99% correct assignments on a subset of Pacific Bioscience reads with 30.1× coverage. To our knowledge, SaaRclust is the first approach for the in silico separation of long reads by chromosome prior to assembly. Availability and implementation https://github.com/daewoooo/SaaRclust

[1]  Victor Guryev,et al.  Direct chromosome-length haplotyping by single-cell sequencing , 2016, Genome research.

[2]  Victor Guryev,et al.  Genome-wide mapping of sister chromatid exchange events in single yeast cells using Strand-seq , 2017, eLife.

[3]  New York Dover,et al.  ON THE CONVERGENCE PROPERTIES OF THE EM ALGORITHM , 1983 .

[4]  Korbinian Schneeberger,et al.  The impact of third generation genomic technologies on plant genome assembly. , 2017, Current opinion in plant biology.

[5]  S. Salzberg,et al.  Repetitive DNA and next-generation sequencing: computational challenges and solutions , 2011, Nature Reviews Genetics.

[6]  Victor Guryev,et al.  Characterizing polymorphic inversions in human genomes by single-cell sequencing , 2016, Genome research.

[7]  Kieran O'Neill,et al.  Assembling draft genomes using contiBAIT , 2017, Bioinform..

[8]  Ryan Brinkman,et al.  BAIT: Organizing genomes and mapping rearrangements in single cells , 2013, Genome Medicine.

[9]  Yongjun Zhao,et al.  DNA template strand sequencing of single-cells maps genomic rearrangements at high resolution , 2012, Nature Methods.

[10]  Pavel A. Pevzner,et al.  Assembly of long error-prone reads using de Bruijn graphs , 2016, Proceedings of the National Academy of Sciences.

[11]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[12]  Michael C. Schatz,et al.  Assemblytics: a web analytics tool for the detection of variants from an assembly , 2016, Bioinform..

[13]  Victor Guryev,et al.  BLM helicase suppresses recombination at G-quadruplex motifs in transcribed genes , 2017, Nature Communications.

[14]  Heng Li,et al.  Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences , 2015, Bioinform..

[15]  Victor Guryev,et al.  Construction of Whole Genomes from Scaffolds Using Single Cell Strand-Seq Data , 2018, bioRxiv.

[16]  David Haussler,et al.  Long-read sequence assembly of the gorilla genome , 2016, Science.

[17]  Victor Guryev,et al.  Dense and accurate whole-chromosome haplotyping of individual genomes , 2017, Nature Communications.

[18]  Eugene W. Myers,et al.  Efficient Local Alignment Discovery amongst Noisy Long Reads , 2014, WABI.

[19]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[20]  Andrew C. Adey,et al.  Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions , 2013, Nature Biotechnology.

[21]  Bernardo J. Clavijo,et al.  Improving and correcting the contiguity of long-read genome assemblies of three plant species using optical mapping and chromosome conformation capture data. , 2017, Genome research.

[22]  Ryan L. Collins,et al.  Multi-platform discovery of haplotype-resolved structural variation in human genomes , 2017, bioRxiv.

[23]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.