The simple fool's guide to population genomics via RNA‐Seq: an introduction to high‐throughput sequencing data analysis

High‐throughput sequencing technologies are currently revolutionizing the field of biology and medicine, yet bioinformatic challenges in analysing very large data sets have slowed the adoption of these technologies by the community of population biologists. We introduce the ‘Simple Fool's Guide to Population Genomics via RNA‐seq’ (SFG), a document intended to serve as an easy‐to‐follow protocol, walking a user through one example of high‐throughput sequencing data analysis of nonmodel organisms. It is by no means an exhaustive protocol, but rather serves as an introduction to the bioinformatic methods used in population genomics, enabling a user to gain familiarity with basic analysis steps. The SFG consists of two parts. This document summarizes the steps needed and lays out the basic themes for each and a simple approach to follow. The second document is the full SFG, publicly available at http://sfg.stanford.edu, that includes detailed protocols for data processing and analysis, along with a repository of custom‐made scripts and sample files. Steps included in the SFG range from tissue collection to de novo assembly, blast annotation, alignment, gene expression, functional enrichment, SNP detection, principal components and FST outlier analyses. Although the technical aspects of population genomics are changing very quickly, our hope is that this document will help population biologists with little to no background in high‐throughput sequencing and bioinformatics to more quickly adopt these new techniques.

[1]  Tom Hsiang,et al.  A biologist's guide to de novo genome assembly using next-generation sequence data: A test with fungal genomes. , 2011, Journal of microbiological methods.

[2]  Martin Vingron,et al.  Ontologizer 2.0 - a multifunctional tool for GO term enrichment analysis and data exploration , 2008, Bioinform..

[3]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[4]  Robert J. Elshire,et al.  A Robust, Simple Genotyping-by-Sequencing (GBS) Approach for High Diversity Species , 2011, PloS one.

[5]  Nicholas Stiffler,et al.  Population Genomics of Parallel Adaptation in Threespine Stickleback using Sequenced RAD Tags , 2010, PLoS genetics.

[6]  Susumu Goto,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 2000, Nucleic Acids Res..

[7]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[8]  Paul Flicek,et al.  Sense from sequence reads: methods for alignment and assembly , 2009, Nature Methods.

[9]  Tina T. Hu,et al.  Multiplexed shotgun genotyping for rapid and efficient genetic mapping. , 2011, Genome research.

[10]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[11]  Matthew E Hudson,et al.  Sequencing breakthroughs for genomic ecology and evolutionary biology , 2008, Molecular ecology resources.

[12]  Alex A. Pollen,et al.  The genomic basis of adaptive evolution in threespine sticklebacks , 2012, Nature.

[13]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[14]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[15]  M. Campbell,et al.  PANTHER: a library of protein families and subfamilies indexed by function. , 2003, Genome research.

[16]  G. Hon,et al.  Next-generation genomics: an integrative approach , 2010, Nature Reviews Genetics.

[17]  Jan van Oeveren,et al.  Complexity Reduction of Polymorphic Sequences (CRoPS™): A Novel Approach for Large-Scale Polymorphism Discovery in Complex Genomes , 2007, PloS one.

[18]  Zhong Wang,et al.  Next-generation transcriptome assembly , 2011, Nature Reviews Genetics.

[19]  The UniProt Consortium,et al.  The Universal Protein Resource (UniProt) 2009 , 2008, Nucleic Acids Res..

[20]  Joseph K. Pickrell,et al.  Signals of recent positive selection in a worldwide sample of human populations. , 2009, Genome research.

[21]  L. Bernatchez,et al.  Transcriptome-wide signature of hybrid breakdown associated with intrinsic reproductive isolation in lake whitefish species pairs (Coregonus spp. Salmonidae) , 2011, Heredity.

[22]  Peter N. Robinson,et al.  GOing Bayesian: model-based gene set analysis of genome-scale data , 2010, Nucleic acids research.

[23]  Markus Schilhabel,et al.  Nucleotide divergence vs. gene expression differentiation: comparative transcriptome sequencing in natural isolates from the carrion crow and its hybrid zone with the hooded crow , 2010, Molecular ecology.

[24]  M. Yandell,et al.  A beginner's guide to eukaryotic genome annotation , 2012, Nature Reviews Genetics.

[25]  F. C. Kafatos,et al.  SNP Genotyping Defines Complex Gene-Flow Boundaries Among African Malaria Vector Mosquitoes , 2010, Science.

[26]  A. Misra,et al.  SNP genotyping: technologies and biomedical applications. , 2007, Annual review of biomedical engineering.

[27]  Heng Li,et al.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data , 2011, Bioinform..

[28]  Steven H. D. Haddock,et al.  Practical Computing for Biologists , 2010 .

[29]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[30]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[31]  Michael A. Schmidt,et al.  SeqEM: an adaptive genotype-calling approach for next-generation sequencing studies , 2010, Bioinform..

[32]  D. Garfield,et al.  Genome-wide polymorphisms show unexpected targets of natural selection , 2012, Proceedings of the Royal Society B: Biological Sciences.

[33]  François Laviolette,et al.  Ray: Simultaneous Assembly of Reads from a Mix of High-Throughput Sequencing Technologies , 2010, J. Comput. Biol..

[34]  M. Marra,et al.  Applications of new sequencing technologies for transcriptome analysis. , 2009, Annual review of genomics and human genetics.

[35]  Matthew D. Young,et al.  From RNA-seq reads to differential expression results , 2010, Genome Biology.

[36]  Albert J. Vilella,et al.  Insights into hominid evolution from the gorilla genome sequence , 2012, Nature.

[37]  M. Baker De novo genome assembly: what every biologist should know , 2012, Nature Methods.

[38]  Steven J. M. Jones,et al.  SNP discovery in black cottonwood (Populus trichocarpa) by population transcriptome resequencing , 2011, Molecular ecology resources.

[39]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[40]  J. Weber,et al.  Human whole-genome shotgun sequencing. , 1997, Genome research.

[41]  J. Stajich,et al.  De novo Assembly of a 40 Mb Eukaryotic Genome from Short Sequence Reads: Sordaria macrospora, a Model Organism for Fungal Morphogenesis , 2010, PLoS genetics.

[42]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[43]  O. Gaggiotti,et al.  A Genome-Scan Method to Identify Selected Loci Appropriate for Both Dominant and Codominant Markers: A Bayesian Perspective , 2008, Genetics.

[44]  Scott M. Williams,et al.  The Genetic Structure and History of Africans and African Americans , 2009, Science.

[45]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[46]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[47]  S. Salzberg,et al.  Bioinformatics challenges of new sequencing technology. , 2008, Trends in genetics : TIG.

[48]  Yi Zhang,et al.  Comparison of the transcriptomes of American chestnut (Castanea dentata) and Chinese chestnut (Castanea mollissima) in response to the chestnut blight infection , 2009, BMC Plant Biology.

[49]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2004, Nucleic Acids Res..

[50]  Michael Lynch,et al.  Estimation of Allele Frequencies From High-Coverage Genome-Sequencing Projects , 2009, Genetics.

[51]  Aaron M. Tarone,et al.  Population-Based Resequencing of Experimentally Evolved Populations Reveals the Genetic Basis of Body Size Variation in Drosophila melanogaster , 2011, PLoS genetics.

[52]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[53]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[54]  Monica F. Poelchau,et al.  A de novo transcriptome of the Asian tiger mosquito, Aedes albopictus, to identify candidate transcripts for diapause preparation , 2011, BMC Genomics.

[55]  Patrick M Hayes,et al.  Construction and application for QTL analysis of a Restriction Site Associated DNA (RAD) linkage map in barley , 2011, BMC Genomics.

[56]  H. Doddapaneni,et al.  Cyanophora paradoxa Genome Elucidates Origin of Photosynthesis in Algae and Plants , 2012, Science.

[57]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[58]  I. Rajapakse,et al.  SEWAL: an open-source platform for next-generation sequence analysis and visualization , 2010, Nucleic acids research.

[59]  Asan,et al.  Sequencing of 50 Human Exomes Reveals Adaptation to High Altitude , 2010, Science.

[60]  Manuel Ruiz,et al.  SNiPlay: a web-based tool for detection, management and analysis of SNPs. Application to grapevine diversity projects , 2011, BMC Bioinformatics.

[61]  Mark L. Blaxter,et al.  Linkage Mapping and Comparative Genomics Using Next-Generation RAD Sequencing of a Non-Model Organism , 2011, PloS one.

[62]  Z. Gompert,et al.  A Hierarchical Bayesian Model for Next-Generation Population Genomics , 2011, Genetics.

[63]  Matko Bosnjak,et al.  REVIGO Summarizes and Visualizes Long Lists of Gene Ontology Terms , 2011, PloS one.

[64]  M. Blaxter,et al.  Genome-wide genetic marker discovery and genotyping using next-generation sequencing , 2011, Nature Reviews Genetics.

[65]  Simon Anders,et al.  Analysing RNA-Seq data with the DESeq package , 2011 .

[66]  J. Hemmer-Hansen,et al.  Application of SNPs for population genetics of nonmodel organisms: new opportunities and challenges , 2011, Molecular ecology resources.

[67]  A. Myburg,et al.  De novo assembled expressed gene catalog of a fast-growing Eucalyptus tree produced by Illumina mRNA-Seq , 2010, BMC Genomics.