Single haplotype assembly of the human genome from a hydatidiform mole

An accurate and complete reference human genome sequence assembly is essential for accurately interpreting individual genomes and associating sequence variation with disease phenotypes. While the current reference genome sequence is of very high quality, gaps and misassemblies remain due to biological and technical complexities. Large repetitive sequences and complex allelic diversity are the two main drivers of assembly error. Although increasing the length of sequence reads and library fragments can help overcome these problems, even the longest available reads do not resolve all regions of the human genome. In order to overcome the issue of allelic diversity, we used genomic DNA from an essentially haploid hydatidiform mole, CHM1. We utilized several resources from this DNA including a set of end-sequenced and indexed BAC clones, an optical map, and 100X whole genome shotgun (WGS) sequence coverage using short (Illumina) read pairs. We used the WGS sequence and the GRCh37 reference assembly to create a sequence assembly of the CHM1 genome. We subsequently incorporated 382 finished CHORI-17 BAC clone sequences to generate a second draft assembly, CHM1_1.1 (NCBI AssemblyDB GCA_000306695.2). Analysis of gene and repeat content show this assembly to be of excellent quality and contiguity, and comparisons to ClinVar and the NHGRI GWAS catalog show that the CHM1 genome does not harbor an excess of deleterious alleles. However, comparison to assembly-independent resources, such as BAC clone end sequences and long reads generated by a different sequencing technology (PacBio), indicate misassembled regions. The great majority of these regions is enriched for structural variation and segmental duplication, and can be resolved in the future by sequencing BAC clone tiling paths. This publicly available first generation assembly will be integrated into the Genome Reference Consortium (GRC) curation framework for further improvement, with the ultimate goal being a completely finished gap-free assembly.

[1]  Joshua M. Korn,et al.  Mapping and sequencing of structural variation from eight human genomes , 2008, Nature.

[2]  J E Hewitt,et al.  Analysis of the tandem repeat locus D4Z4 associated with facioscapulohumeral muscular dystrophy. , 1994, Human molecular genetics.

[3]  P. Kwok,et al.  The homozygous complete hydatidiform mole: a unique resource for genome studies. , 1997, Genomics.

[4]  Alejandro A. Schäffer,et al.  WindowMasker: window-based masker for sequenced genomes , 2006, Bioinform..

[5]  E. Eichler,et al.  Characterization of Missing Human Genome Sequences and Copy-number Polymorphic Insertions , 2010, Nature Methods.

[6]  Zhaoshi Jiang,et al.  Evolutionary toggling of the MAPT 17q21.31 inversion region , 2008, Nature Genetics.

[7]  Pui-Yan Kwok,et al.  Paternal origins of complete hydatidiform moles proven by whole genome single-nucleotide polymorphism haplotyping. , 2002, Genomics.

[8]  Nora Husain,et al.  Clone DB: an integrated NCBI resource for clone-associated data , 2012, Nucleic Acids Res..

[9]  B. Gold,et al.  New insights on the evolution of the SMN1 and SMN2 region: simulation and meta-analysis for allele and haplotype frequency calculations , 2004, European Journal of Human Genetics.

[10]  B. Trask,et al.  Segmental duplications: organization and impact within the current human genome project assembly. , 2001, Genome research.

[11]  D. Haussler,et al.  A physical map of the human genome , 2001, Nature.

[12]  L. Ptáček,et al.  Mutations in Potassium Channel Kir2.6 Cause Susceptibility to Thyrotoxic Hypokalemic Periodic Paralysis , 2010, Cell.

[13]  Ryan E. Mills,et al.  An initial map of insertion and deletion (INDEL) variation in the human genome. , 2006, Genome research.

[14]  M. Kyba,et al.  An isogenetic myoblast expression screen identifies DUX4‐mediated FSHD‐associated molecular pathologies , 2008, The EMBO journal.

[15]  D. Swallow,et al.  Multiple transcripts of MUC3: evidence for two genes, MUC3A and MUC3B. , 2000, Biochemical and biophysical research communications.

[16]  Rong Chen,et al.  The Reference Human Genome Demonstrates High Risk of Type 1 Diabetes and Other Disorders , 2011, Pacific Symposium on Biocomputing.

[17]  D R Bentley,et al.  The DNA sequence and comparative analysis of human chromosome 20 , 2004, Nature.

[18]  Peter H. Sudmant,et al.  Evolution of Human-Specific Neural SRGAP2 Genes by Incomplete Segmental Duplication , 2012, Cell.

[19]  David C. Schwartz,et al.  High-resolution human genome structure by single-molecule analysis , 2010, Proceedings of the National Academy of Sciences.

[20]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[21]  U. Surti,et al.  The genetics of gestational trophoblastic disease: a rare complication of pregnancy. , 2012, Cancer genetics.

[22]  Faraz Hach,et al.  mrsFAST: a cache-oblivious algorithm for short-read mapping , 2010, Nature Methods.

[23]  E. Eichler,et al.  Limitations of next-generation genome sequence assembly , 2011, Nature Methods.

[24]  B. Roe,et al.  A 9.1-kb gap in the genome reference map is shown to be a stable deletion/insertion polymorphism of ancestral origin. , 2002, Genomics.

[25]  Jamie K. Scott,et al.  Complete haplotype sequence of the human immunoglobulin heavy-chain variable, diversity, and joining genes and characterization of allelic and copy-number variation. , 2013, American journal of human genetics.

[26]  Heng Li,et al.  Toward better understanding of artifacts in variant calling from high-coverage samples , 2014, Bioinform..

[27]  J. Weber,et al.  A 360-kb interchromosomal duplication of the human HYDIN locus. , 2006, Genomics.

[28]  Evan E. Eichler,et al.  An assessment of the sequence gaps: Unfinished business in a finished human genome , 2004, Nature Reviews Genetics.

[29]  Philip M. Kim,et al.  Paired-End Mapping Reveals Extensive Structural Variation in the Human Genome , 2007, Science.

[30]  Lars Bolund,et al.  Building the sequence map of the human pan-genome , 2010, Nature Biotechnology.

[31]  E. Eichler,et al.  Segmental duplications and copy-number variation in the human genome. , 2005, American journal of human genetics.

[32]  Alkes L. Price,et al.  Using population admixture to help complete maps of the human genome , 2013, Nature Genetics.

[33]  G. Lathrop,et al.  Associations of distinct variants of the intestinal mucin gene MUC3A with ulcerative colitis and Crohn's disease , 2001, Journal of Human Genetics.

[34]  Jun Ye,et al.  Enhanced Membrane-tethered Mucin 3 (MUC3) Expression by a Tetrameric Branched Peptide with a Conserved TFLK Motif Inhibits Bacteria Adherence* , 2013, The Journal of Biological Chemistry.

[35]  E. Eichler,et al.  Ancestral reconstruction of segmental duplications reveals punctuated cores of human genome evolution , 2007, Nature Genetics.

[36]  Peter H. Sudmant,et al.  Diversity of Human Copy Number Variation and Multicopy Genes , 2010, Science.

[37]  Carl Baker,et al.  Evolution and diversity of copy number variation in the great ape lineage , 2013, Genome research.