A field guide to whole-genome sequencing, assembly and annotation

Genome sequencing projects were long confined to biomedical model organisms and required the concerted effort of large consortia. Rapid progress in high‐throughput sequencing technology and the simultaneous development of bioinformatic tools have democratized the field. It is now within reach for individual research groups in the eco‐evolutionary and conservation community to generate de novo draft genome sequences for any organism of choice. Because of the cost and considerable effort involved in such an endeavour, the important first step is to thoroughly consider whether a genome sequence is necessary for addressing the biological question at hand. Once this decision is taken, a genome project requires careful planning with respect to the organism involved and the intended quality of the genome draft. Here, we briefly review the state of the art within this field and provide a step‐by‐step introduction to the workflow involved in genome sequencing, assembly and annotation with particular reference to large and complex genomes. This tutorial is targeted at scientists with a background in conservation genetics, but more generally, provides useful practical guidance for researchers engaging in whole‐genome sequencing projects.

[1]  Stefan R. Henz,et al.  Reference-guided assembly of four diverse Arabidopsis thaliana genomes , 2011, Proceedings of the National Academy of Sciences.

[2]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.

[3]  Ignas Bunikis,et al.  Whole genome sequencing of the black grouse (Tetrao tetrix): reference guided assembly suggests faster-Z and MHC evolution , 2014, BMC Genomics.

[4]  B. Mishra,et al.  Comparing De Novo Genome Assembly: The Long and Short of It , 2011, PloS one.

[5]  L. Bernatchez,et al.  Adaptive evolutionary conservation: towards a unified concept for defining conservation units , 2001, Molecular ecology.

[6]  Nuno A. Fonseca,et al.  Assemblathon 1: a competitive assessment of de novo short read assembly methods. , 2011, Genome research.

[7]  Antonis Rokas,et al.  Prevention, diagnosis and treatment of high‐throughput sequencing data pathologies , 2014, Molecular ecology.

[8]  S. Koren,et al.  Assembly algorithms for next-generation sequencing data. , 2010, Genomics.

[9]  Denise D. Wilson Prevention, diagnosis, and treatment , 2007 .

[10]  K. Crandall,et al.  Considering evolutionary processes in conservation biology. , 2000, Trends in ecology & evolution.

[11]  M. Nickerson,et al.  A locally funded Puerto Rican parrot (Amazona vittata) genome sequencing project increases avian data and advances young researcher education , 2012, GigaScience.

[12]  Yun Sung Cho,et al.  The tiger genome and comparative analysis with lion and snow leopard genomes , 2013, Nature Communications.

[13]  René L. Warren,et al.  Assembling millions of short DNA sequences using SSAKE , 2006, Bioinform..

[14]  J. Galindo,et al.  Applications of next generation sequencing in molecular ecology of non-model organisms , 2011, Heredity.

[15]  S. Narum,et al.  Genotyping‐by‐sequencing in ecological and conservation genomics , 2013, Molecular ecology.

[16]  Michael C. Schatz,et al.  Cloud Computing and the DNA Data Race , 2010, Nature Biotechnology.

[17]  Tim Hubbard Finishing the euchromatic sequence of the human genome , 2004 .

[18]  M. Pop,et al.  Sequence assembly demystified , 2013, Nature Reviews Genetics.

[19]  Jeanette C Papp,et al.  The value of avian genomics to the conservation of wildlife , 2009, BMC Genomics.

[20]  Robert D Schnabel,et al.  SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries , 2008, Nature Methods.

[21]  M. Schatz,et al.  Current challenges in de novo plant genome sequencing and assembly , 2012, Genome Biology.

[22]  Anit Raja Banerjee,et al.  An Introduction to Conservation Genetics , 2010, The Yale Journal of Biology and Medicine.

[23]  Keith Bradnam,et al.  CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes , 2007, Bioinform..

[24]  J. Logan,et al.  The king cobra genome reveals dynamic gene evolution and adaptation in the snake venom system , 2013, Proceedings of the National Academy of Sciences.

[25]  Simon H. Martin,et al.  Butterfly genome reveals promiscuous exchange of mimicry adaptations among species , 2012, Nature.

[26]  Páll Melsted,et al.  A Genome Sequence Resource for the Aye-Aye (Daubentonia madagascariensis), a Nocturnal Lemur from Madagascar , 2011, Genome biology and evolution.

[27]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[28]  Paul Flicek,et al.  Sense from sequence reads: methods for alignment and assembly , 2009, Nature Methods.

[29]  C. Primmer From Conservation Genetics to Conservation Genomics , 2009, Annals of the New York Academy of Sciences.

[30]  Inanç Birol,et al.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species , 2013, GigaScience.

[31]  M. Hattori,et al.  Evolutionary changes of multiple visual pigment genes in the complete genome of Pacific bluefin tuna , 2013, Proceedings of the National Academy of Sciences.

[32]  Burkhard Morgenstern,et al.  AUGUSTUS: ab initio prediction of alternative transcripts , 2006, Nucleic Acids Res..

[33]  M. Bruford,et al.  Black and white and read all over: the past, present and future of giant panda genetics , 2012, Molecular ecology.

[34]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[35]  A. Künstner,et al.  ConDeTri - A Content Dependent Read Trimmer for Illumina Data , 2011, PloS one.

[36]  Joshua S. Paul,et al.  Genotype and SNP calling from next-generation sequencing data , 2011, Nature Reviews Genetics.

[37]  H. Ellegren Genome sequencing and population genomics in non-model organisms. , 2014, Trends in ecology & evolution.

[38]  W. Miller,et al.  Polar and brown bear genomes reveal ancient admixture and demographic footprints of past climate change , 2012, Proceedings of the National Academy of Sciences.

[39]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[40]  Ola Spjuth,et al.  Lessons learned from implementing a national infrastructure in Sweden for storage and analysis of next-generation sequencing data , 2013, GigaScience.

[41]  Pall I. Olason,et al.  The genomic landscape of species divergence in Ficedula flycatchers , 2012, Nature.

[42]  G. Luikart,et al.  Conservation and the genetics of populations , 2006 .

[43]  W. Pirovano,et al.  Toward almost closed genomes with GapFiller , 2012, Genome Biology.

[44]  P. Provero,et al.  Genome-wide signatures of convergent evolution in echolocating mammals , 2013, Nature.

[45]  Eric S. Lander,et al.  Sequencing the nuclear genome of the extinct woolly mammoth , 2008, Nature.

[46]  Greg C. Lee,et al.  Assembler for de novo assembly of large genomes , 2013, Proceedings of the National Academy of Sciences.

[47]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[48]  R. Gibbs,et al.  Mind the Gap: Upgrading Genomes with Pacific Biosciences RS Long-Read Sequencing Technology , 2012, PloS one.

[49]  Zhong Wang,et al.  Next-generation transcriptome assembly , 2011, Nature Reviews Genetics.

[50]  Mehrdad Hajibabaei,et al.  Next‐generation sequencing technologies for environmental DNA research , 2012, Molecular ecology.

[51]  V. Bansal,et al.  The importance of phase information for human genomics , 2011, Nature Reviews Genetics.

[52]  J. Lupski,et al.  The complete genome of an individual by massively parallel DNA sequencing , 2008, Nature.

[53]  M. Berriman,et al.  REAPR: a universal tool for genome assembly evaluation , 2013, Genome Biology.

[54]  Dawei Li,et al.  The sequence and de novo assembly of the giant panda genome , 2010, Nature.

[55]  M. Yandell,et al.  A beginner's guide to eukaryotic genome annotation , 2012, Nature Reviews Genetics.

[56]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[57]  Carolyn Tregidgo,et al.  Genome Sequencing and Analysis of the Tasmanian Devil and Its Transmissible Cancer , 2012, Cell.

[58]  Aaron R. Quinlan,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[59]  E. Ostrander,et al.  Genomics and conservation genetics. , 2006, Trends in ecology & evolution.

[60]  Loretta Auvil,et al.  Reference-assisted chromosome assembly , 2013, Proceedings of the National Academy of Sciences.

[61]  Jean L. Chang,et al.  Initial sequence of the chimpanzee genome and comparison with the human genome , 2005, Nature.

[62]  Aaron L. Halpern,et al.  Consensus generation and variant detection by Celera Assembler , 2008, Bioinform..

[63]  G. Perry,et al.  Aye-aye population genomic analyses highlight an important center of endemism in northern Madagascar , 2013, Proceedings of the National Academy of Sciences.

[64]  Inge Jonassen,et al.  The genome sequence of Atlantic cod reveals a unique immune system , 2011, Nature.

[65]  Drew R. Schield,et al.  The Burmese python genome reveals the molecular basis for extreme adaptation in snakes , 2013, Proceedings of the National Academy of Sciences.

[66]  D. Falush,et al.  Inference of Population Structure using Dense Haplotype Data , 2012, PLoS genetics.

[67]  D. Falush,et al.  A Genetic Atlas of Human Admixture History , 2014, Science.

[68]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[69]  N. J. Ouborg,et al.  Conservation genetics in transition to conservation genomics. , 2010, Trends in genetics : TIG.

[70]  David Haussler,et al.  Tissue sampling methods and standards for vertebrate genomics , 2012, GigaScience.

[71]  E. Mardis Next-generation DNA sequencing methods. , 2008, Annual review of genomics and human genetics.

[72]  N. Gemmell,et al.  Parallel Tagged Next-Generation Sequencing on Pooled Samples – A New Approach for Population Genetics in Ecology and Conservation , 2013, PloS one.

[73]  Albert J. Vilella,et al.  A high-resolution map of human evolutionary constraint using 29 mammals , 2011, Nature.

[74]  Sonja J. Prohaska,et al.  Analysis of the African coelacanth genome sheds light on tetrapod evolution , 2013, Nature.

[75]  Andreas Heger,et al.  Insights into the evolution of Darwin’s finches from comparative analysis of the Geospiza magnirostris genome sequence , 2013, BMC Genomics.

[76]  Sergey Koren,et al.  The bonobo genome compared with the chimpanzee and human genomes , 2012, Nature.

[77]  V. Loeschcke,et al.  Conservation Genetics , 2019, Handbook of Statistical Genomics.

[78]  Jonathan D. Ballou,et al.  Introduction to Conservation Genetics: Frontmatter , 2010 .

[79]  Petr Novák,et al.  RepeatExplorer: a Galaxy-based web server for genome-wide characterization of eukaryotic repetitive elements from next-generation sequence reads , 2013, Bioinform..

[80]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[81]  L. Hillier,et al.  PCAP: a whole-genome assembly program. , 2003, Genome research.

[82]  Qiang Wang,et al.  The oyster genome reveals stress adaptation and complexity of shell formation , 2012, Nature.

[83]  M. Schatz,et al.  Hybrid error correction and de novo assembly of single-molecule sequencing reads , 2012, Nature Biotechnology.

[84]  Monica C Munoz-Torres,et al.  Web Apollo: a web-based genomic annotation editing platform , 2013, Genome Biology.

[85]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[86]  Bronwen L. Aken,et al.  The draft genomes of soft–shell turtle and green sea turtle yield insights into the development and evolution of the turtle–specific body plan , 2013, Nature Genetics.

[87]  P. Phillips,et al.  Using Population Genomics to Detect Selection in Natural Populations: Key Concepts and Methodological Considerations , 2010, International Journal of Plant Sciences.

[88]  Walter Pirovano,et al.  BIOINFORMATICS APPLICATIONS , 2022 .

[89]  Albert J. Vilella,et al.  Insights into hominid evolution from the gorilla genome sequence , 2012, Nature.

[90]  G. Luikart,et al.  Genomics and the future of conservation genetics , 2010, Nature Reviews Genetics.

[91]  Charles D. Johnson,et al.  A Draft De Novo Genome Assembly for the Northern Bobwhite (Colinus virginianus) Reveals Evidence for a Rapid Decline in Effective Population Size Beginning in the Late Pleistocene , 2014, PloS one.

[92]  Marcel Martin Cutadapt removes adapter sequences from high-throughput sequencing reads , 2011 .

[93]  M. Schatz,et al.  Assembly of large genomes using second-generation sequencing. , 2010, Genome research.

[94]  J. Höglund,et al.  Sequencing of the core MHC region of black grouse (Tetrao tetrix) and comparative genomics of the galliform MHC , 2012, BMC Genomics.

[95]  David Hernández,et al.  De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. , 2008, Genome research.

[96]  International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome , 2004 .

[97]  Washington Seattle An integrated encyclopedia of DNA elements in the human genome , 2016 .

[98]  Albert J. Vilella,et al.  Comparative and demographic analysis of orang-utan genomes , 2011, Nature.

[99]  Jing He,et al.  Peregrine and saker falcon genome sequences provide insights into evolution of a predatory lifestyle , 2013, Nature Genetics.

[100]  J. Wolf Principles of transcriptome analysis and gene expression quantification: an RNA‐seq tutorial , 2013, Molecular Ecology Resources.

[101]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[102]  B. Browning,et al.  Haplotype phasing: existing methods and new developments , 2011, Nature Reviews Genetics.

[103]  M. Schatz,et al.  Genome assembly forensics: finding the elusive mis-assembly , 2008, Genome Biology.

[104]  Adam M. Phillippy,et al.  Hawkeye and AMOS: visualizing and assessing the quality of genome assemblies , 2013, Briefings Bioinform..

[105]  M. J. Davis,et al.  Annotated genes and nonannotated genomes: cross‐species use of Gene Ontology in ecology and evolution research , 2013, Molecular ecology.

[106]  R. Nielsen,et al.  Unlocking the vault: next‐generation museum population genomics , 2013, Molecular ecology.

[107]  Mihai Pop,et al.  Genome assembly reborn: recent computational challenges , 2009, Briefings Bioinform..

[108]  Bin Chen,et al.  Relating genes to function: identifying enriched transcription factors using the ENCODE ChIP-Seq significance tool , 2013, Bioinform..

[109]  Nagarjun Vijay,et al.  Challenges and strategies in transcriptome assembly and differential gene expression quantification. A comprehensive in silico assessment of RNA‐seq experiments , 2013, Molecular ecology.

[110]  J. Höglund Evolutionary Conservation Genetics , 2009 .

[111]  A. Gnirke,et al.  High-quality draft assemblies of mammalian genomes from massively parallel sequence data , 2010, Proceedings of the National Academy of Sciences.

[112]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[113]  Michael Roberts,et al.  The MaSuRCA genome assembler , 2013, Bioinform..

[114]  P. Hedrick PERSPECTIVE: HIGHLY VARIABLE LOCI AND THEIR INTERPRETATION IN EVOLUTION AND CONSERVATION , 1999, Evolution; international journal of organic evolution.

[115]  O. Ryder,et al.  Conservation genomics of threatened animal species. , 2013, Annual review of animal biosciences.

[116]  Yun Sung Cho,et al.  Minke whale genome and aquatic adaptation in cetaceans , 2013, Nature Genetics.

[117]  K. Lindblad-Toh,et al.  Assisted assembly: how to improve a de novo genome assembly by using related species , 2009, Genome Biology.

[118]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[119]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[120]  E. Mardis Next-generation sequencing platforms. , 2013, Annual review of analytical chemistry.

[121]  Cole Trapnell,et al.  How to map billions of short reads onto genomes , 2009, Nature Biotechnology.

[122]  François Laviolette,et al.  Ray: Simultaneous Assembly of Reads from a Mix of High-Throughput Sequencing Technologies , 2010, J. Comput. Biol..

[123]  Mukesh Jain,et al.  NGS QC Toolkit: A Toolkit for Quality Control of Next Generation Sequencing Data , 2012, PloS one.

[124]  Joshua M. Stuart,et al.  Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. , 2009, The Journal of heredity.

[125]  Sofia M. C. Robb,et al.  MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. , 2007, Genome research.

[126]  G. Armstrong Conservation and the Genetics of Populations , 2008 .

[127]  E. Teeling,et al.  How and why should we implement genomics into conservation? , 2014, Evolutionary applications.

[128]  Mauricio O. Carneiro,et al.  From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline , 2013, Current protocols in bioinformatics.

[129]  S. Pääbo,et al.  Genetic analyses from ancient DNA. , 2004, Annual review of genetics.

[130]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[131]  Rob Ogden,et al.  Gene-associated markers provide tools for tackling illegal fishing and false eco-certification , 2012, Nature Communications.

[132]  Robert J. Elshire,et al.  A Robust, Simple Genotyping-by-Sequencing (GBS) Approach for High Diversity Species , 2011, PloS one.

[133]  M. Berriman,et al.  Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps , 2010, Genome Biology.

[134]  C. Ponting,et al.  Sequencing depth and coverage: key considerations in genomic analyses , 2014, Nature Reviews Genetics.

[135]  C. Herrera,et al.  Untangling individual variation in natural populations: ecological, genetic and epigenetic correlates of long‐term inequality in herbivory , 2011, Molecular ecology.

[136]  R. Durbin,et al.  Inference of human population history from individual whole-genome sequences. , 2011, Nature.

[137]  Stephen M. Mount,et al.  Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. , 2003, Nucleic acids research.

[138]  Loretta Auvil,et al.  The yak genome and adaptation to life at high altitude , 2012, Nature Genetics.

[139]  Jun Wang,et al.  SNP Calling, Genotype Calling, and Sample Allele Frequency Estimation from New-Generation Sequencing Data , 2012, PloS one.

[140]  Heng Li,et al.  Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly , 2012, Bioinform..

[141]  Tom H. Pringle,et al.  Genetic diversity and population structure of the endangered marsupial Sarcophilus harrisii (Tasmanian devil) , 2011, Proceedings of the National Academy of Sciences.

[142]  You-Qiang Song,et al.  Evaluation of next-generation sequencing software in mapping and assembly , 2011, Journal of Human Genetics.

[143]  G. Weinstock,et al.  The Atlas genome assembly system. , 2004, Genome research.

[144]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[145]  Douglas G. Scofield,et al.  The Norway spruce genome sequence and conifer genome evolution , 2013, Nature.