Read clouds uncover variation in complex regions of the human genome.

Although an increasing amount of human genetic variation is being identified and recorded, determining variants within repeated sequences of the human genome remains a challenge. Most population and genome-wide association studies have therefore been unable to consider variation in these regions. Core to the problem is the lack of a sequencing technology that produces reads with sufficient length and accuracy to enable unique mapping. Here, we present a novel methodology of using read clouds, obtained by accurate short-read sequencing of DNA derived from long fragment libraries, to confidently align short reads within repeat regions and enable accurate variant discovery. Our novel algorithm, Random Field Aligner (RFA), captures the relationships among the short reads governed by the long read process via a Markov Random Field. We utilized a modified version of the Illumina TruSeq synthetic long-read protocol, which yielded shallow-sequenced read clouds. We test RFA through extensive simulations and apply it to discover variants on the NA12878 human sample, for which shallow TruSeq read cloud sequencing data are available, and on an invasive breast carcinoma genome that we sequenced using the same method. We demonstrate that RFA facilitates accurate recovery of variation in 155 Mb of the human genome, including 94% of 67 Mb of segmental duplication sequence and 96% of 11 Mb of transcribed sequence, that are currently hidden from short-read technologies.

[1]  E. Lander,et al.  Genomic mapping by fingerprinting random clones: a mathematical analysis. , 1988, Genomics.

[2]  P. Krapivsky Kinetics of random sequential parking on a line , 1992 .

[3]  B. Berger,et al.  Sequencing a genome by walking with clone-end sequences: a mathematical analysis. , 1999 .

[4]  B. Trask,et al.  Segmental duplications: organization and impact within the current human genome project assembly. , 2001, Genome research.

[5]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[6]  M. Adams,et al.  Recent Segmental Duplications in the Human Genome , 2002, Science.

[7]  Thomas W. Mühleisen,et al.  Large recurrent microdeletions associated with schizophrenia , 2008, Nature.

[8]  Jessica R. Wolff,et al.  Microduplications of 16p11.2 are Associated with Schizophrenia , 2009, Nature Genetics.

[9]  Peter H. Sudmant,et al.  Diversity of Human Copy Number Variation and Multicopy Genes , 2010, Science.

[10]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[11]  Mariko Sasaki,et al.  Genome destabilization by homologous recombination in the germ line , 2010, Nature Reviews Molecular Cell Biology.

[12]  Gary D Bader,et al.  Functional impact of global rare copy number variation in autism spectrum disorders , 2010, Nature.

[13]  Alice McCarthy Third generation DNA sequencing: pacific biosciences' single molecule real time technology. , 2010, Chemistry & biology.

[14]  Tomas W. Fitzgerald,et al.  Origins and functional impact of copy number variation in the human genome , 2010, Nature.

[15]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[16]  Andrew C. Adey,et al.  Haplotype-resolved genome sequencing of a Gujarati Indian individual , 2011, Nature Biotechnology.

[17]  Z. Ou,et al.  Observation and prediction of recurrent human translocations mediated by NAHR between nonhomologous chromosomes. , 2011, Genome research.

[18]  R. Wilson,et al.  Modernizing Reference Genome Assemblies , 2011, PLoS biology.

[19]  Michael C. Schatz,et al.  Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability score , 2012, Bioinform..

[20]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[21]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[22]  Jessica C. Ebert,et al.  Accurate whole genome sequencing and haplotyping from10-20 human cells , 2012, Nature.

[23]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[24]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[25]  Timothy P. L. Smith,et al.  Reducing assembly complexity of microbial genomes with single-molecule sequencing , 2013, Genome Biology.

[26]  Aaron M. Newman,et al.  The genome sequence of the colonial chordate, Botryllus schlosseri , 2013, eLife.

[27]  Kali T. Witherspoon,et al.  Refining analyses of copy number variation identifies specific genes associated with developmental delay , 2014, Nature Genetics.

[28]  C. Nusbaum,et al.  Comprehensive variation discovery in single human genomes , 2014, Nature Genetics.

[29]  Peter H. Sudmant,et al.  Palindromic GOLGA8 core duplicons promote chromosome 15q13.3 microdeletion and evolutionary instability , 2014, Nature Genetics.

[30]  Mark J. P. Chaisson,et al.  Reconstructing complex regions of genomes using long-read sequencing technology , 2014, Genome research.

[31]  Andrew C. Adey,et al.  Haplotype-resolved whole-genome sequencing by contiguity-preserving transposition and combinatorial indexing , 2014, Nature Genetics.

[32]  Dmitry Pushkarev,et al.  Whole-genome haplotyping using long reads and statistical methods , 2014, Nature Biotechnology.

[33]  Rajiv C. McCoy,et al.  Illumina TruSeq Synthetic Long-Reads Empower De Novo Assembly and Resolve Complex, Highly-Repetitive Transposable Elements , 2014, bioRxiv.

[34]  Brian C. Thomas,et al.  Accurate, multi-kb reads resolve complex populations and detect rare microorganisms , 2015, Genome research.

[35]  P. Ashton,et al.  MinION nanopore sequencing identifies the position and structure of a bacterial antibiotic resistance island , 2014, Nature Biotechnology.

[36]  Mark J. P. Chaisson,et al.  Resolving the complexity of the human genome using single-molecule sequencing , 2014, Nature.