Informed and automated k-mer size selection for genome assembly

MOTIVATION Genome assembly tools based on the de Bruijn graph framework rely on a parameter k, which represents a trade-off between several competing effects that are difficult to quantify. There is currently a lack of tools that would automatically estimate the best k to use and/or quickly generate histograms of k-mer abundances that would allow the user to make an informed decision. RESULTS We develop a fast and accurate sampling method that constructs approximate abundance histograms with several orders of magnitude performance improvement over traditional methods. We then present a fast heuristic that uses the generated abundance histograms for putative k values to estimate the best possible value of k. We test the effectiveness of our tool using diverse sequencing datasets and find that its choice of k leads to some of the best assemblies. AVAILABILITY Our tool KmerGenie is freely available at: http://kmergenie.bx.psu.edu/.

[1]  Alexey A. Gurevich,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[2]  Dominique Lavenier,et al.  DSK: k-mer counting with very low memory usage , 2013, Bioinform..

[3]  Inanç Birol,et al.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species , 2013, GigaScience.

[4]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[5]  Rayan Chikhi,et al.  Space-efficient and exact de Bruijn graph representation based on a Bloom filter , 2012, Algorithms for Molecular Biology.

[6]  C. Nusbaum,et al.  Finished bacterial genomes from shotgun sequence data , 2012, Genome research.

[7]  Laurie Goodman,et al.  Large and linked in scientific publishing , 2012, GigaScience.

[8]  Siu-Ming Yiu,et al.  IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth , 2012, Bioinform..

[9]  David N Breslauer,et al.  Silks: Properties and uses of natural and designed variants , 2012, Biopolymers.

[10]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[11]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[12]  Maxim Teslenko,et al.  MrBayes 3.2: Efficient Bayesian Phylogenetic Inference and Model Choice Across a Large Model Space , 2012, Systematic biology.

[13]  Hideo Aoki,et al.  Draft Genome of the Pearl Oyster Pinctada fucata: A Platform for Understanding Bivalve Biology , 2012, DNA research : an international journal for rapid publication of reports on genes and genomes.

[14]  P. Pevzner,et al.  Efficient de novo assembly of single-cell bacterial genomes from short-read data sets , 2011, Nature Biotechnology.

[15]  Nuno A. Fonseca,et al.  Assemblathon 1: a competitive assessment of de novo short read assembly methods. , 2011, Genome research.

[16]  Aaron Wiegand,et al.  Spatial analysis of biomineralization associated gene expression from the mantle organ of the pearl oyster Pinctada maxima , 2011, BMC Genomics.

[17]  Rongqing Zhang,et al.  Identification of Genes Directly Involved in Shell Formation and Their Functions in Pearl Oyster, Pinctada fucata , 2011, PloS one.

[18]  Kaoru Maeyama,et al.  Deep Sequencing of ESTs from Nacreous and Prismatic Layer Producing Tissues and a Screen for Novel Shell Formation-Related Genes in the Pearl Oyster , 2011, PloS one.

[19]  Bernard M. Degnan,et al.  Ultrastructure of the Mantle of the Gastropod Haliotis asinina and Mechanisms of Shell Regionalization , 2011, Cells Tissues Organs.

[20]  Sophie Arnaud-Haond,et al.  Evolutionary Patterns in Pearl Oysters of the Genus Pinctada (Bivalvia: Pteriidae) , 2011, Marine Biotechnology.

[21]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[22]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[23]  Benjamin Marie,et al.  Transcriptome and proteome analysis of Pinctada margaritifera calcifying mantle and shell: focus on biomineralization , 2010, BMC Genomics.

[24]  Tomoyuki Miyashita,et al.  Prismin: A New Matrix Protein Family in the Japanese Pearl Oyster (Pinctada fucata) Involved in Prismatic Layer Formation , 2010, Zoological science.

[25]  C. DeLisi,et al.  Phenotypic connections in surprising places , 2010, Genome Biology.

[26]  Michael Kube,et al.  Parallel evolution of nacre building gene sets in molluscs. , 2010, Molecular biology and evolution.

[27]  Jack A. Gilbert,et al.  Pyrosequencing of Mytilus galloprovincialis cDNAs: Tissue-Specific Expression Patterns , 2010, PloS one.

[28]  P. Wincker,et al.  Generation and analysis of a 29,745 unique Expressed Sequence Tags from the Pacific oyster (Crassostrea gigas) assembled into a publicly accessible database: the GigasDatabase , 2009, BMC Genomics.

[29]  D. Jerry,et al.  High levels of intra-specific variation in the NG repeat region of the Pinctada maxima N66 organic matrix protein. , 2009 .

[30]  Gu Jing,et al.  Cloning and Characterization of Prisilkin-39, a Novel Matrix Protein Serving a Dual Role in the Prismatic Layer Formation from the Oyster Pinctada fucata* , 2009, Journal of Biological Chemistry.

[31]  Alexis A. Rodriguez,et al.  The metagenomics RAST server – a public resource for the automatic phylogenetic and functional analysis of metagenomes , 2008, BMC Bioinformatics.

[32]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[33]  Mark J. P. Chaisson,et al.  Short read fragment assembly of bacterial genomes. , 2008, Genome research.

[34]  Hiromichi Nagasawa,et al.  The structure–function relationship analysis of Prismalin‐14 from the prismatic layer of the Japanese pearl oyster, Pinctada fucata , 2007, The FEBS journal.

[35]  William H. Press,et al.  Numerical Recipes 3rd Edition: The Art of Scientific Computing , 2007 .

[36]  Gert Wörheide,et al.  A rapidly evolving secretome builds and patterns a sea shell , 2006, BMC Biology.

[37]  Cen Zhang,et al.  A novel matrix protein family participating in the prismatic layer framework formation of pearl oyster, Pinctada fucata. , 2006, Biochemical and biophysical research communications.

[38]  Masato Yano,et al.  Shematrin: a family of glycine-rich structural proteins in the shell of the pearl oyster Pinctada fucata. , 2006, Comparative biochemistry and physiology. Part B, Biochemistry & molecular biology.

[39]  Steve Weiner,et al.  Mollusk shell formation: a source of new concepts for understanding biomineralization processes. , 2006, Chemistry.

[40]  Graham Cormode,et al.  Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling , 2005, VLDB.

[41]  Hiroshi Miyamoto,et al.  The Carbonic Anhydrase Domain Protein Nacrein is Expressed in the Epithelial Cells of the Mantle and Acts as a Negative Regulator in Calcification in the Mollusc Pinctada fucata , 2005, Zoological science.

[42]  S. Berland,et al.  Zona Localization of Shell Matrix Proteins in Mantle of Haliotis tuberculata (Mollusca, Gastropoda) , 2004, Marine Biotechnology.

[43]  M. P. Cummings PHYLIP (Phylogeny Inference Package) , 2004 .

[44]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[45]  Hideki Sezutsu,et al.  Dynamic Rearrangement Within the Antheraea pernyi Silk Fibroin Gene Is Associated with Four Types of Repetitive Units , 2000, Journal of Molecular Evolution.

[46]  R. Lewis,et al.  Molecular architecture and evolution of a modular spider silk protein gene. , 2000, Science.

[47]  J. Nardi,et al.  Diversity of odourant binding proteins revealed by an expressed sequence tag project on male Manduca sexta moth antennae , 1999, Insect molecular biology.

[48]  A. Force,et al.  Preservation of duplicate genes by complementary, degenerative mutations. , 1999, Genetics.

[49]  J. Thompson,et al.  The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. , 1997, Nucleic acids research.

[50]  H. Bayley,et al.  Sequence of abductin, the molluscan ‘rubber’ protein , 1997, Current Biology.

[51]  T. Fujikawa,et al.  Structures of mollusc shell framework proteins , 1997, Nature.

[52]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[53]  Nam-Hai Chua,et al.  cDNA sequence of a virus-inducible, glycine-rich protein gene from rice , 1991, Plant Molecular Biology.

[54]  B Keller,et al.  Glycine‐rich cell wall proteins in bean: gene structure and association of the protein with the vascular system. , 1988, The EMBO journal.

[55]  G. Gutman,et al.  Slipped-strand mispairing: a major mechanism for DNA sequence evolution. , 1987, Molecular biology and evolution.

[56]  Jack W. Szostak,et al.  The double-strand-break repair model for recombination , 1983, Cell.

[57]  R. F. Manning,et al.  Internal structure of the silk fibroin gene of Bombyx mori. II. Remarkable polymorphism of the organization of crystalline and amorphous coding sequences. , 1980, The Journal of biological chemistry.

[58]  E. Eichler,et al.  Limitations of next-generation genome sequence assembly , 2011, Nature Methods.

[59]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[60]  Benjamin Marie,et al.  Molluscan shell proteins: primary structure, origin, and evolution. , 2008, Current topics in developmental biology.

[61]  Takeshi Takeuchi,et al.  Biphasic and Dually Coordinated Expression of the Genes Encoding Major Shell Matrix Proteins in the Pearl Oyster Pinctada fucata , 2005, Marine Biotechnology.