High-quality genome (re)assembly using chromosomal contact data

Closing gaps in draft genome assemblies can be costly and time-consuming, and published genomes are therefore often left ‘unfinished.’ Here we show that genome-wide chromosome conformation capture (3C) data can be used to overcome these limitations, and present a computational approach rooted in polymer physics that determines the most likely genome structure using chromosomal contact data. This algorithm—named GRAAL—generates high-quality assemblies of genomes in which repeated and duplicated regions are accurately represented and offers a direct probabilistic interpretation of the computed structures. We first validated GRAAL on the reference genome of Saccharomyces cerevisiae, as well as other yeast isolates, where GRAAL recovered both known and unknown complex chromosomal structural variations. We then applied GRAAL to the finishing of the assembly of Trichoderma reesei and obtained a number of contigs congruent with the know karyotype of this species. Finally, we showed that GRAAL can accurately reconstruct human chromosomes from either fragments generated in silico or contigs obtained from de novo assembly. In all these applications, GRAAL compared favourably to recently published programmes implementing related approaches.

[1]  Reza Kalhor,et al.  Genome architectures revealed by tethered chromosome conformation capture and population-based modeling , 2011, Nature Biotechnology.

[2]  Bing Ren,et al.  Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing , 2013, Nature Biotechnology.

[3]  M. Rieder,et al.  Detection of structural variants and indels within exome data , 2011, Nature Methods.

[4]  Bradley P. Coe,et al.  Genome structural variation discovery and genotyping , 2011, Nature Reviews Genetics.

[5]  S. James Press,et al.  Subjective and objective Bayesian statistics : principles, models, and applications , 2003 .

[6]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[7]  Han Fang,et al.  "Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions" , 2014 .

[8]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[9]  G. Fischer,et al.  A set of genetically diverged Saccharomyces cerevisiae strains with markerless deletions of multiple auxotrophic genes , 2014, Yeast.

[10]  J. Lawrence,et al.  The three-dimensional folding of the α-globin gene domain reveals formation of chromatin globules , 2011, Nature Structural &Molecular Biology.

[11]  Christophe Zimmer,et al.  Filling annotation gaps in yeast genomes using genome-wide contact maps , 2014, Bioinform..

[12]  Gianni Liti,et al.  Yeast evolution and comparative genomics. , 2005, Annual review of microbiology.

[13]  B. Horwitz,et al.  Special issue: Trichoderma--from basic Biology to Biotechnology. , 2012, Microbiology.

[14]  Nicolas Pinto,et al.  PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation , 2009, Parallel Comput..

[15]  Kurt Kremer,et al.  From a melt of rings to chromosome territories: the role of topological constraints in genome folding , 2013, Reports on progress in physics. Physical Society.

[16]  Wouter de Laat,et al.  3C-based technologies to study the shape of the genome. , 2012, Methods.

[17]  Christophe Zimmer,et al.  Computational models of large-scale genome architecture. , 2014, International review of cell and molecular biology.

[18]  Jesse R. Dixon,et al.  Topological Domains in Mammalian Genomes Identified by Analysis of Chromatin Interactions , 2012, Nature.

[19]  M. Pop,et al.  Sequence assembly demystified , 2013, Nature Reviews Genetics.

[20]  Neva C. Durand,et al.  A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping , 2014, Cell.

[21]  Job Dekker,et al.  Organization of the Mitotic Chromosome , 2013, Science.

[22]  Evgeny M. Zdobnov,et al.  Genome Sequence of Aedes aegypti, a Major Arbovirus Vector , 2007, Science.

[23]  Ming Hu,et al.  HiCNorm: removing biases in Hi-C data via Poisson regression , 2012, Bioinform..

[24]  M. Penttilä,et al.  Electrophoretic karyotyping of wild-type and mutant Trichoderma longibrachiatum (reesei) strains , 1992, Current Genetics.

[25]  B. Dujon,et al.  Eucaryotic genome evolution through the spontaneous duplication of large chromosomal segments , 2004, The EMBO journal.

[26]  Deacon J. Sweeney,et al.  Sequencing and automated whole-genome optical mapping of the genome of a domestic goat (Capra hircus) , 2012, Nature Biotechnology.

[27]  M. Rubin,et al.  Oncogene-mediated alterations in chromatin conformation , 2012, Proceedings of the National Academy of Sciences.

[28]  A. Miele,et al.  Mechanisms that regulate localization of a DNA double-strand break to the nuclear periphery. , 2009, Genes & development.

[29]  K Rippe,et al.  Making contacts on a nucleic acid polymer. , 2001, Trends in biochemical sciences.

[30]  A. Gnirke,et al.  High-quality draft assemblies of mammalian genomes from massively parallel sequence data , 2010, Proceedings of the National Academy of Sciences.

[31]  Frank Alber,et al.  Hi-Corrector: a fast, scalable and memory-efficient package for normalizing large-scale Hi-C data , 2014, Bioinform..

[32]  B. Barrell,et al.  Life with 6000 Genes , 1996, Science.

[33]  R. Jarman,et al.  Genetic Mapping of Specific Interactions between Aedes aegypti Mosquitoes and Dengue Viruses , 2013, PLoS genetics.

[34]  Leopold Parts,et al.  Assessing the complex architecture of polygenic traits in diverged yeast populations , 2011, Molecular ecology.

[35]  S. James Press,et al.  Subjective and Objective Bayesian Statistics , 2002 .

[36]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[37]  Mark J. P. Chaisson,et al.  Reconstructing complex regions of genomes using long-read sequencing technology , 2014, Genome research.

[38]  J. Dekker,et al.  Capturing Chromosome Conformation , 2002, Science.

[39]  J. Haber,et al.  Rearrangements of highly polymorphic regions near telomeres of Saccharomyces cerevisiae , 1984, Molecular and cellular biology.

[40]  Hideki Tanizawa,et al.  Mapping of long-range associations throughout the fission yeast genome reveals global genome organization linked to transcriptional regulation , 2010, Nucleic acids research.

[41]  F. Alber,et al.  Physical tethering and volume exclusion determine higher-order genome organization in budding yeast , 2012, Genome research.

[42]  Romain Koszul,et al.  Normalization of a chromosomal contact map , 2012, BMC Genomics.

[43]  M. Rey,et al.  Chromosomal and genetic analysis of the electrophoretic karyotype of Trichoderma reesei: mapping of the cellulase and xylanase genes , 1992, Molecular microbiology.

[44]  Bernard Henrissat,et al.  Genome sequencing and analysis of the biomass-degrading fungus Trichoderma reesei (syn. Hypocrea jecorina) , 2008, Nature Biotechnology.

[45]  E. Eichler,et al.  Limitations of next-generation genome sequence assembly , 2011, Nature Methods.

[46]  William Stafford Noble,et al.  Statistical confidence estimation for Hi-C data reveals regulatory chromatin contacts , 2014, Genome research.

[47]  Christophe Zimmer,et al.  A Predictive Computational Model of the Dynamic 3D Interphase Yeast Nucleus , 2012, Current Biology.

[48]  Jun S. Liu,et al.  The Multiple-Try Method and Local Optimization in Metropolis Sampling , 2000 .

[49]  I. Amit,et al.  Comprehensive mapping of long range interactions reveals folding principles of the human genome , 2011 .

[50]  S. Oliver,et al.  Chromosomal evolution in Saccharomyces , 2000, Nature.

[51]  Yan Li,et al.  A high-resolution map of three-dimensional chromatin interactome in human cells , 2013, Nature.

[52]  M. Carlson,et al.  Evolution of the dispersed SUC gene family of Saccharomyces by rearrangements of chromosome telomeres , 1985, Molecular and cellular biology.

[53]  A. Tanay,et al.  Probabilistic modeling of Hi-C contact maps eliminates systematic biases to characterize global chromosomal architecture , 2011, Nature Genetics.

[54]  Mario Nicodemi,et al.  Complexity of chromatin folding is captured by the strings and binders switch model , 2012, Proceedings of the National Academy of Sciences.

[55]  A. Tanay,et al.  Three-Dimensional Folding and Functional Organization Principles of the Drosophila Genome , 2012, Cell.

[56]  William Stafford Noble,et al.  A Three-Dimensional Model of the Yeast Genome , 2010, Nature.

[57]  Inanç Birol,et al.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species , 2013, GigaScience.

[58]  H. C. Mak,et al.  Genome interpretation and assembly—recent progress and next steps , 2012, Nature Biotechnology.

[59]  Noam Kaplan,et al.  High-throughput genome scaffolding from in-vivo DNA interaction frequency , 2013, Nature Biotechnology.

[60]  L. Mirny,et al.  Iterative Correction of Hi-C Data Reveals Hallmarks of Chromosome Organization , 2012, Nature Methods.

[61]  Michael Nilges,et al.  Materials and Methods Som Text Figs. S1 to S6 References Movies S1 to S5 Inferential Structure Determination , 2022 .