De Novo Assembly of Complete Chloroplast Genomes from Non-model Species Based on a K-mer Frequency-Based Selection of Chloroplast Reads from Total DNA Sequences

Whole Genome Shotgun (WGS) sequences of plant species often contain an abundance of reads that are derived from the chloroplast genome. Up to now these reads have generally been identified and assembled into chloroplast genomes based on homology to chloroplasts from related species. This re-sequencing approach may select against structural differences between the genomes especially in non-model species for which no close relatives have been sequenced before. The alternative approach is to de novo assemble the chloroplast genome from total genomic DNA sequences. In this study, we used k-mer frequency tables to identify and extract the chloroplast reads from the WGS reads and assemble these using a highly integrated and automated custom pipeline. Our strategy includes steps aimed at optimizing assemblies and filling gaps which are left due to coverage variation in the WGS dataset. We have successfully de novo assembled three complete chloroplast genomes from plant species with a range of nuclear genome sizes to demonstrate the universality of our approach: Solanum lycopersicum (0.9 Gb), Aegilops tauschii (4 Gb) and Paphiopedilum henryanum (25 Gb). We also highlight the need to optimize the choice of k and the amount of data used. This new and cost-effective method for de novo short read assembly will facilitate the study of complete chloroplast genomes with more accurate analyses and inferences, especially in non-model plant genomes.

[1]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[2]  S. Wölfl,et al.  The chloroplast genome of the “basal” angiosperm Calycanthus fertilis – structural and phylogenetic analyses , 2003, Plant Systematics and Evolution.

[3]  Jianying Yuan,et al.  Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects , 2013, 1308.2012.

[4]  Eran Halperin,et al.  Recycler: an algorithm for detecting plasmids from de novo assembly graphs , 2016, bioRxiv.

[5]  Pilar Hernández,et al.  Genomic profiling of plastid DNA variation in the Mediterranean olive tree , 2011, BMC Plant Biology.

[6]  Q. Cronk,et al.  Ultra-barcoding in cacao (Theobroma spp.; Malvaceae) using whole chloroplast genomes and nuclear ribosomal DNA. , 2012, American journal of botany.

[7]  James Leebens-Mack,et al.  Methods for obtaining and analyzing whole chloroplast genome sequences. , 2005, Methods in enzymology.

[8]  Jerrold I. Davis,et al.  Plastid genomes and deep relationships among the commelinid monocot angiosperms , 2013, Cladistics : the international journal of the Willi Hennig Society.

[9]  M. Ragan,et al.  Next-generation phylogenomics , 2013, Biology Direct.

[10]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[11]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[12]  Amit Dhingra,et al.  Rapid and accurate pyrosequencing of angiosperm plastid genomes , 2006, BMC Plant Biology.

[13]  Paul Medvedev,et al.  Informed and automated k-mer size selection for genome assembly , 2013, Bioinform..

[14]  Shilin Chen,et al.  Plant DNA barcoding: from gene to genome , 2015, Biological reviews of the Cambridge Philosophical Society.

[15]  Peter M Hollingsworth,et al.  Selecting barcoding loci for plants: evaluation of seven candidate loci with species‐level sampling in three divergent groups of land plants , 2009, Molecular ecology resources.

[16]  Luís M. S. Russo,et al.  Pattern matching through Chaos Game Representation: bridging numerical and discrete data structures for biological sequence analysis , 2012, Algorithms for Molecular Biology.

[17]  S. Tanksley,et al.  Microprep protocol for extraction of DNA from tomato and other herbaceous plants , 1995, Plant Molecular Biology Reporter.

[18]  S. Kurtz,et al.  A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes , 2008, BMC Genomics.

[19]  B. Stummann,et al.  Preparation of chloroplast DNA from pea plastids isolated in a medium of high ionic strength. , 1984, Analytical biochemistry.

[20]  Jason D. Buenrostro,et al.  Finding a (pine) needle in a haystack: chloroplast genome sequence divergence in rare and widespread pines , 2010, Molecular ecology.

[21]  M T Clegg,et al.  Evolution of a noncoding region of the chloroplast genome. , 1993, Molecular phylogenetics and evolution.

[22]  Wei Zhu,et al.  The complete chloroplast genome sequence of Mahonia bealei (Berberidaceae) reveals a significant expansion of the inverted repeat and phylogenetic relationship with other angiosperms. , 2013, Gene.

[23]  Juliane C. Dohm,et al.  Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.

[24]  Dmitry Antipov,et al.  plasmidSPAdes: Assembling Plasmids from Whole Genome Sequencing Data , 2016, bioRxiv.

[25]  R. A. Atherton,et al.  Whole genome sequencing of enriched chloroplast DNA using the Illumina GAII platform , 2010, Plant Methods.

[26]  Rens Holmer,et al.  Herbarium genomics: plastome sequence assembly from a range of herbarium specimens using an Iterative Organelle Genome Assembly pipeline , 2016 .

[27]  M. Rogalski,et al.  The Complete Chloroplast Genome Sequence of Podocarpus lambertii: Genome Structure, Evolutionary Aspects, Gene Content and SSR Detection , 2014, PloS one.

[28]  Olivier David,et al.  DNA barcode analysis: a comparison of phylogenetic and statistical classification methods , 2009, BMC Bioinformatics.

[29]  Richard Cronn,et al.  Increasing phylogenetic resolution at low taxonomic levels using massively parallel sequencing of chloroplast genomes , 2009, BMC Biology.

[30]  Kui Lin,et al.  Sequencing Angiosperm Plastid Genomes Made Easy: A Complete Set of Universal Primers and a Case Study on the Phylogeny of Saxifragales , 2013, Genome biology and evolution.

[31]  J. Palmer,et al.  Conservation of chloroplast genome structure among vascular plants , 1986, Current Genetics.

[32]  M. Marra,et al.  Applications of next-generation sequencing technologies in functional genomics. , 2008, Genomics.

[33]  Bengt Oxelman,et al.  Chloroplastrps16 intron phylogeny of the tribeSileneae (Caryophyllaceae) , 1997, Plant Systematics and Evolution.

[34]  C. Davis,et al.  Phylogenomics and a posteriori data partitioning resolve the Cretaceous angiosperm radiation Malpighiales , 2012, Proceedings of the National Academy of Sciences.

[35]  Linda A. Raubeson,et al.  Comparative chloroplast genomics: analyses including new sequences from the angiosperms Nuphar advena and Ranunculus macranthus , 2007, BMC Genomics.

[36]  S. Salzberg,et al.  Using MUMmer to Identify Similar Regions in Large Sequence Sets , 2003, Current protocols in bioinformatics.

[37]  Pamela S Soltis,et al.  Using plastid genome-scale data to resolve enigmatic relationships among basal angiosperms , 2007, Proceedings of the National Academy of Sciences.

[38]  Juliane C. Dohm,et al.  Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems , 2011, Genome Biology.

[39]  R. Henry,et al.  Chloroplast genome sequences from total DNA for plant identification. , 2011, Plant biotechnology journal.

[40]  James Leebens-Mack,et al.  Analysis of 81 genes from 64 plastid genomes resolves relationships in angiosperms and identifies genome-scale evolutionary patterns , 2007, Proceedings of the National Academy of Sciences.

[41]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[42]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[43]  Toni Gabaldón,et al.  A phylogenomics approach for selecting robust sets of phylogenetic markers , 2014, Nucleic acids research.

[44]  Emma J. McIntosh,et al.  Capturing chloroplast variation for molecular ecology studies: a simple next generation sequencing approach applied to a rainforest tree , 2013, BMC Ecology.

[45]  Songnian Hu,et al.  An efficient procedure for plant organellar genome assembly, based on whole genome data from the 454 GS FLX sequencing platform , 2011, Plant Methods.

[46]  PérezNelson,et al.  Computational Performance Assessment of k-mer Counting Algorithms. , 2016 .

[47]  Timothy B. Stockwell,et al.  Evaluation of next generation sequencing platforms for population targeted sequencing studies , 2009, Genome Biology.

[48]  Ki-Joong Kim,et al.  Complete chloroplast genome sequences from Korean ginseng (Panax schinseng Nees) and comparative analysis of sequence evolution among 17 vascular plants. , 2004, DNA research : an international journal for rapid publication of reports on genes and genomes.

[49]  Dhundy Bastola,et al.  Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis , 2014, Briefings Bioinform..

[50]  Lauris Kaplinski,et al.  GenomeTester4: a toolkit for performing basic set operations - union, intersection and complement on k-mer lists , 2015, GigaScience.

[51]  Nelson Enrique Vera Parra,et al.  Computational Performance Assessment of k-mer Counting Algorithms , 2016, J. Comput. Biol..