Exploring Pandora's Box: Potential and Pitfalls of Low Coverage Genome Surveys for Evolutionary Biology

High throughput sequencing technologies are revolutionizing genetic research. With this “rise of the machines”, genomic sequences can be obtained even for unknown genomes within a short time and for reasonable costs. This has enabled evolutionary biologists studying genetically unexplored species to identify molecular markers or genomic regions of interest (e.g. micro- and minisatellites, mitochondrial and nuclear genes) by sequencing only a fraction of the genome. However, when using such datasets from non-model species, it is possible that DNA from non-target contaminant species such as bacteria, viruses, fungi, or other eukaryotic organisms may complicate the interpretation of the results. In this study we analysed 14 genomic pyrosequencing libraries of aquatic non-model taxa from four major evolutionary lineages. We quantified the amount of suitable micro- and minisatellites, mitochondrial genomes, known nuclear genes and transposable elements and searched for contamination from various sources using bioinformatic approaches. Our results show that in all sequence libraries with estimated coverage of about 0.02–25%, many appropriate micro- and minisatellites, mitochondrial gene sequences and nuclear genes from different KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways could be identified and characterized. These can serve as markers for phylogenetic and population genetic analyses. A central finding of our study is that several genomic libraries suffered from different biases owing to non-target DNA or mobile elements. In particular, viruses, bacteria or eukaryote endosymbionts contributed significantly (up to 10%) to some of the libraries analysed. If not identified as such, genetic markers developed from high-throughput sequencing data for non-model organisms may bias evolutionary studies or fail completely in experimental tests. In conclusion, our study demonstrates the enormous potential of low-coverage genome survey sequences and suggests bioinformatic analysis workflows. The results also advise a more sophisticated filtering for problematic sequences and non-target genome sequences prior to developing markers.

[1]  M. Gardner,et al.  Rise of the machines – recommendations for ecologists when using next generation sequencing for microsatellite development , 2011, Molecular ecology resources.

[2]  A. Liston,et al.  Building a model: developing genomic resources for common milkweed (Asclepias syriaca) with low coverage genome sequencing , 2011, BMC Genomics.

[3]  R. Petit,et al.  Current trends in microsatellite genotyping , 2011, Molecular ecology resources.

[4]  C. Gissi,et al.  Evolution of the mitochondrial genome of Metazoa as exemplified by comparison of congeneric species , 2008, Heredity.

[5]  D. Pollock,et al.  Launching microsatellites: a review of mutation processes and methods of phylogenetic interference. , 1997, The Journal of heredity.

[6]  A. von Haeseler,et al.  A phylogenomic approach to resolve the arthropod tree of life. , 2010, Molecular biology and evolution.

[7]  T. Glenn,et al.  Isolating microsatellite DNA loci. , 2005, Methods in enzymology.

[8]  M. Blaser,et al.  Helicobacter pylori in health and disease. , 2009, Gastroenterology.

[9]  L. Cavelier,et al.  Analysis of mtDNA copy number and composition of single mitochondrial particles using flow cytometry and PCR. , 2000, Experimental cell research.

[10]  Jean-François Martin,et al.  Representativeness of microsatellite distributions in genomes, as revealed by 454 GS-FLX Titanium pyrosequencing , 2010, BMC Genomics.

[11]  L. Zane,et al.  Strategies for microsatellite isolation: a review , 2002, Molecular ecology.

[12]  M. Hossain,et al.  Detection of new hosts for white spot syndrome virus of shrimp using nested polymerase chain reaction , 2001 .

[13]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[14]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2006, Nucleic Acids Research.

[15]  D. Hillis,et al.  Ribosomal DNA: Molecular Evolution and Phylogenetic Inference , 1991, The Quarterly Review of Biology.

[16]  Todd A. Castoe,et al.  Rapid Microsatellite Identification from Illumina Paired-End Genomic Sequencing in Two Birds and a Snake , 2012, PloS one.

[17]  R. O’Neill,et al.  Abundant Human DNA Contamination Identified in Non-Primate Genome Databases , 2011, PloS one.

[18]  John M. Hancock,et al.  SIMPLE34: an improved and enhanced implementation for VAX and Sun computers of the SIMPLE algorithm for analysis of clustered repetitive motifs in nucleotide sequences , 1994, Comput. Appl. Biosci..

[19]  Mark J. Clement,et al.  Targeted Amplicon Sequencing (TAS): A Scalable Next-Gen Approach to Multilocus, Multitaxa Phylogenetics , 2011, Genome biology and evolution.

[20]  Emese Meglécz,et al.  QDD: a user-friendly program to select microsatellite markers and design primers from large sequencing projects , 2010, Bioinform..

[21]  A. Vogler,et al.  Why barcode? High-throughput multiplex sequencing of mitochondrial genomes for molecular systematics , 2010, Nucleic acids research.

[22]  J. Mullikin,et al.  SSAHA: a fast search method for large DNA databases. , 2001, Genome research.

[23]  Matthew E Hudson,et al.  Sequencing breakthroughs for genomic ecology and evolutionary biology , 2008, Molecular ecology resources.

[24]  David A. Rasmussen,et al.  What can you do with 0.1× genome coverage? A case study based on a genome survey of the scuttle fly Megaselia scalaris (Phoridae) , 2009, BMC Genomics.

[25]  R. Edwards,et al.  Viral metagenomics , 2005, Nature Reviews Microbiology.

[26]  C. Lo,et al.  Natural and experimental infection of white spot syndrome virus (WSSV) in benthic larvae of mud crab Scylla serrata. , 2000, Diseases of aquatic organisms.

[27]  A. Nederbragt,et al.  Identification and Quantification of Genomic Repeats and Sample Contamination in Assemblies of 454 Pyrosequencing Reads , 2010 .

[28]  C. Mayer,et al.  The mitochondrial genome of Colossendeis megalonyx supports a basal position of Colossendeidae within the Pycnogonida. , 2011, Molecular phylogenetics and evolution.

[29]  R. Holderegger,et al.  Cost-effective, species-specific microsatellite development for the endangered Dwarf Bulrush (Typha minima) using next-generation sequencing technology. , 2010, The Journal of heredity.

[30]  Robert J. Elshire,et al.  A Robust, Simple Genotyping-by-Sequencing (GBS) Approach for High Diversity Species , 2011, PloS one.

[31]  Nicholas Stiffler,et al.  Population Genomics of Parallel Adaptation in Threespine Stickleback using Sequenced RAD Tags , 2010, PLoS genetics.

[32]  Christoph Held,et al.  Isolation of microsatellites from unknown genomes using known genomes as enrichment templates , 2008 .

[33]  A. Fujiyama,et al.  Using the Acropora digitifera genome to understand coral responses to environmental change , 2011, Nature.

[34]  M. Pfenninger,et al.  The complete mitochondrial genome of Radix balthica (Pulmonata, Basommatophora), obtained by low coverage shot gun next generation sequencing. , 2010, Molecular phylogenetics and evolution.

[35]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[36]  Bastien Chevreux MIRA: An Automated Genome and EST Assembler , 2007 .

[37]  Björn Canbäck,et al.  ARWEN: a program to detect tRNA genes in metazoan mitochondrial nucleotide sequences , 2008, Bioinform..

[38]  David A. Pearce,et al.  The Discovery of New Deep-Sea Hydrothermal Vent Communities in the Southern Ocean and Implications for Biogeography , 2012, PLoS biology.

[39]  A. Murray,et al.  Diversity and genomics of Antarctic marine micro-organisms , 2007, Philosophical Transactions of the Royal Society B: Biological Sciences.

[40]  T. Glenn Field guide to next‐generation DNA sequencers , 2011, Molecular ecology resources.

[41]  O. Lepais,et al.  Comparison of random and SSR‐enriched shotgun pyrosequencing for microsatellite discovery and single multiplex PCR optimization in Acacia harpophylla F. Muell. Ex Benth , 2011, Molecular ecology resources.

[42]  J. Galindo,et al.  Applications of next generation sequencing in molecular ecology of non-model organisms , 2011, Heredity.

[43]  W. Pirovano,et al.  The complete mitogenome of Cylindrus obtusus (Helicidae, Ariantinae) using Illumina next generation sequencing , 2012, BMC Genomics.

[44]  Wanjun Gu,et al.  Rapid identification of thousands of copperhead snake (Agkistrodon contortrix) microsatellite loci from modest amounts of 454 shotgun genome sequence , 2010, Molecular ecology resources.

[45]  John M. Hancock,et al.  Detecting cryptically simple protein sequences using the SIMPLE algorithm , 2002, Bioinform..

[46]  Hervé Philippe,et al.  Origin of land plants revisited in the light of sequence contamination and missing data , 2012, Current Biology.

[47]  Kiyoko F. Aoki-Kinoshita,et al.  From genomics to chemical genomics: new developments in KEGG , 2005, Nucleic Acids Res..

[48]  David Q. Matus,et al.  Broad phylogenomic sampling improves resolution of the animal tree of life , 2008, Nature.

[49]  T. Ryan Gregory,et al.  Eukaryotic genome size databases , 2006, Nucleic Acids Res..

[50]  Darren L. Smith,et al.  454-Pyrosequencing: A Molecular Battiscope for Freshwater Viral Ecology , 2010, Genes.

[51]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[52]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[53]  Christoph Mayer,et al.  Genome-wide analysis of tandem repeats in Daphnia pulex - a comparative approach , 2010, BMC Genomics.

[54]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[55]  N. Gemmell,et al.  Fast, cost-effective development of species-specific microsatellite markers by genomic sequencing. , 2009, BioTechniques.

[56]  Peter Schattner,et al.  The tRNAscan-SE, snoscan and snoGPS web servers for the detection of tRNAs and snoRNAs , 2005, Nucleic Acids Res..

[57]  The first genome size estimates for six species of krill (Malacostraca, Euphausiidae): large genomes at the north and south poles , 2012, Polar Biology.

[58]  J. Daub,et al.  Ecdysozoan Mitogenomics: Evidence for a Common Origin of the Legged Invertebrates, the Panarthropoda , 2010, Genome biology and evolution.

[59]  I. Baums,et al.  Gene Discovery in the Threatened Elkhorn Coral: 454 Sequencing of the Acropora palmata Transcriptome , 2011, PloS one.

[60]  S. Richards,et al.  Widespread Lateral Gene Transfer from Intracellular Bacteria to Multicellular Eukaryotes , 2007, Science.

[61]  Pascal Frey,et al.  High‐throughput microsatellite isolation through 454 GS‐FLX Titanium pyrosequencing of enriched DNA libraries , 2011, Molecular ecology resources.

[62]  David B. Goldstein,et al.  Microsatellites: Evolution and Applications , 1999 .

[63]  D. Kirchman,et al.  Temporal study of Helicobacter pylori presence in coastal freshwater, estuary and marine waters. , 2011, Water research.

[64]  R. Edwards,et al.  Fast Identification and Removal of Sequence Contamination from Genomic and Metagenomic Datasets , 2011, PloS one.

[65]  P. Stadler,et al.  Mitochondrial genome evolution in Ophiuroidea, Echinoidea, and Holothuroidea: insights in phylogenetic relationships of Echinodermata. , 2010, Molecular phylogenetics and evolution.

[66]  J. Perry,et al.  Rapid microsatellite development for water striders by next-generation sequencing. , 2011, The Journal of heredity.

[67]  Sarah J. Bourlat,et al.  Xenoturbella is a deuterostome that eats molluscs , 2003, Nature.

[68]  Ulf Michael Widenius,et al.  MySQL reference manual - documentation from the source , 2002 .

[69]  Christoph Mayer,et al.  Visualizing differences in phylogenetic information content of alignments and distinction of three classes of long-branch effects , 2007, BMC Evolutionary Biology.

[70]  H. Philippe,et al.  Resolving Difficult Phylogenetic Questions: Why More Sequences Are Not Enough , 2011, PLoS biology.

[71]  S Rozen,et al.  Primer3 on the WWW for general users and for biologist programmers. , 2000, Methods in molecular biology.

[72]  Susumu Goto,et al.  The KEGG resource for deciphering the genome , 2004, Nucleic Acids Res..

[73]  P. Sunnucks,et al.  Efficient genetic markers for population biology. , 2000, Trends in ecology & evolution.

[74]  Akiyasu C. Yoshizawa,et al.  KAAS: an automatic genome annotation and pathway reconstruction server , 2007, Environmental health perspectives.

[75]  Fabian Kilpert,et al.  Multiple rearrangements in mitochondrial genomes of Isopoda and phylogenetic implications. , 2012, Molecular phylogenetics and evolution.

[76]  M. Wingfield,et al.  Microsatellite discovery by deep sequencing of enriched genomic libraries. , 2009, BioTechniques.