The complex task of choosing a de novo assembly: Lessons from fungal genomes

Selecting the values of parameters used by de novo genomic assembly programs, or choosing an optimal de novo assembly from several runs obtained with different parameters or programs, are tasks that can require complex decision-making. A key parameter that must be supplied to typical next generation sequencing (NGS) assemblers is the k-mer length, i.e., the word size that determines which de Bruijn graph the program should map out and use. The topic of assembly selection criteria was recently revisited in the Assemblathon 2 study (Bradnam et al., 2013). Although no clear message was delivered with regard to optimal k-mer lengths, it was shown with examples that it is sometimes important to decide if one is most interested in optimizing the sequences of protein-coding genes (the gene space) or in optimizing the whole genome sequence including the intergenic DNA, as what is best for one criterion may not be best for the other. In the present study, our aim was to better understand how the assembly of unicellular fungi (which are typically intermediate in size and complexity between prokaryotes and metazoan eukaryotes) can change as one varies the k-mer values over a wide range. We used two different de novo assembly programs (SOAPdenovo2 and ABySS), and simple assembly metrics that also focused on success in assembling the gene space and repetitive elements. A recent increase in Illumina read length to around 150 bp allowed us to attempt de novo assemblies with a larger range of k-mers, up to 127 bp. We applied these methods to Illumina paired-end sequencing read sets of fungal strains of Paracoccidioides brasiliensis and other species. By visualizing the results in simple plots, we were able to track the effect of changing k-mer size and assembly program, and to demonstrate how such plots can readily reveal discontinuities or other unexpected characteristics that assembly programs can present in practice, especially when they are used in a traditional molecular microbiology laboratory with a 'genomics corner'. Here we propose and apply a component of a first pass validation methodology for benchmarking and understanding fungal genome de novo assembly processes.

[1]  Siu-Ming Yiu,et al.  IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth , 2012, Bioinform..

[2]  Berat Z. Haznedaroglu,et al.  Optimization of de novo transcriptome assembly from high-throughput short read sequencing data improves functional annotation for non-model organisms , 2012, BMC Bioinformatics.

[3]  Fred P. Brooks,et al.  The Mythical Man-Month , 1975, Reliable Software.

[4]  Martin Kollmar,et al.  A novel hybrid gene prediction method employing protein multiple sequence alignments , 2011, Bioinform..

[5]  Gregory Kucherov,et al.  Using Cascading Bloom Filters to Improve the Memory Usage for de Brujin Graphs , 2013, WABI.

[6]  Rayan Chikhi,et al.  Space-Efficient and Exact de Bruijn Graph Representation Based on a Bloom Filter , 2012, WABI.

[7]  György Abrusán,et al.  The Distribution of L1 and Alu Retroelements in Relation to GC Content on Human Sex Chromosomes Is Consistent with the Ectopic Recombination Model , 2006, Journal of Molecular Evolution.

[8]  Paul Medvedev,et al.  Maximum Likelihood Genome Assembly , 2009, J. Comput. Biol..

[9]  Fred B. Schneider,et al.  A Theory of Graphs , 1993 .

[10]  J. McEwen,et al.  Limits to Sequencing and de novo Assembly: Classic Benchmark Sequences for Optimizing Fungal NGS Designs , 2014 .

[11]  J. Oliver,et al.  Sequence Compositional Complexity of DNA through an Entropic Segmentation Method , 1998 .

[12]  I. Dworkin,et al.  A pipeline for the de novo assembly of the Themira biloba (Sepsidae: Diptera) transcriptome using a multiple k-mer length approach , 2014, BMC Genomics.

[13]  A. Gnirke,et al.  ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads , 2009, Genome Biology.

[14]  Jr. Frederick P. Brooks,et al.  The mythical man-month (anniversary ed.) , 1995 .

[15]  Wentian Li The complexity of DNA , 1997 .

[16]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[17]  Keith Bradnam,et al.  Assessing the gene space in draft genomes , 2008, Nucleic acids research.

[18]  B. Barrell,et al.  Life with 6000 Genes , 1996, Science.

[19]  R. Britten,et al.  Repeated Sequences in DNA , 1968 .

[20]  W Li,et al.  Compositional heterogeneity within, and uniformity between, DNA sequences of yeast chromosomes. , 1998, Genome research.

[21]  Keith Bradnam,et al.  CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes , 2007, Bioinform..

[22]  G. Bernardi,et al.  The gene distribution of the maize genome. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Inanç Birol,et al.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species , 2013, GigaScience.

[24]  Gregory Kucherov,et al.  Using cascading Bloom filters to improve the memory usage for de Brujin graphs , 2013, Algorithms for Molecular Biology.

[25]  F. Foury,et al.  Human genetic diseases: a cross-talk between man and yeast. , 1997, Gene.

[26]  Douglas E. Bassett,et al.  Yeast genes and human disease , 1996, Nature.

[27]  W. Schaffner,et al.  Proto-Oncogenes, Unlike Harmless Genes, Tend to Be Dispersed in the Human Genome: Selection Against Out-of-Register Recombination? , 1999, Biological chemistry.

[28]  M. Berriman,et al.  A comprehensive evaluation of assembly scaffolding tools , 2014, Genome Biology.

[29]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[30]  Rayan Chikhi,et al.  Space-efficient and exact de Bruijn graph representation based on a Bloom filter , 2012, Algorithms for Molecular Biology.

[31]  Eric H. Davidson,et al.  Gene activity in early development , 1968 .

[32]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[33]  Ron Shamir,et al.  Design of shortest double-stranded DNA sequences covering all k-mers with applications to protein-binding microarrays and synthetic enhancers , 2013, Bioinform..

[34]  M. Drummond,et al.  Health Care Technology: Effectiveness, Efficiency and Public Policy@@@Methods for the Economic Evaluation of Health Care Programmes , 1988 .

[35]  Pedro Miramontes,et al.  Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome , 2013, BMC Bioinformatics.

[36]  Paul Medvedev,et al.  Informed and automated k-mer size selection for genome assembly , 2013, Bioinform..

[37]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[38]  S. Wilson Methods for the economic evaluation of health care programmes , 1987 .

[39]  Wentian Li The Measure of Compositional Heterogeneity in DNA Sequences Is Related to Measures of Complexity , 1997, adap-org/9709007.

[40]  Elizabeth Pennisi Genomics. DNA sequencers still waiting for the nanopore revolution. , 2014, Science.

[41]  Pavel A Pevzner,et al.  How to apply de Bruijn graphs to genome assembly. , 2011, Nature biotechnology.

[42]  Giorgio Bernardi,et al.  Structural and evolutionary genomics : natural selection in genome evolution , 2004 .

[43]  Yaokun Wu,et al.  De Bruijn digraphs and affine transformations , 2005, Eur. J. Comb..

[44]  A S Fraenkel,et al.  Proof that sequences of A,C,G, and T can be assembled to produce chains of ultimate length avoiding repetitions everywhere. , 1966, Progress in nucleic acid research and molecular biology.

[45]  Christina A. Cuomo,et al.  Comparative Genomic Analysis of Human Fungal Pathogens Causing Paracoccidioidomycosis , 2011, PLoS genetics.

[46]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[47]  J. Montoya-Burgos,et al.  Optimization of de novo transcriptome assembly from next-generation sequencing data. , 2010, Genome research.

[48]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.