Unraveling genomic variation from next generation sequencing data

Elucidating the content of a DNA sequence is critical to deeper understand and decode the genetic information for any biological system. As next generation sequencing (NGS) techniques have become cheaper and more advanced in throughput over time, great innovations and breakthrough conclusions have been generated in various biological areas. Few of these areas, which get shaped by the new technological advances, involve evolution of species, microbial mapping, population genetics, genome-wide association studies (GWAs), comparative genomics, variant analysis, gene expression, gene regulation, epigenetics and personalized medicine. While NGS techniques stand as key players in modern biological research, the analysis and the interpretation of the vast amount of data that gets produced is a not an easy or a trivial task and still remains a great challenge in the field of bioinformatics. Therefore, efficient tools to cope with information overload, tackle the high complexity and provide meaningful visualizations to make the knowledge extraction easier are essential. In this article, we briefly refer to the sequencing methodologies and the available equipment to serve these analyses and we describe the data formats of the files which get produced by them. We conclude with a thorough review of tools developed to efficiently store, analyze and visualize such data with emphasis in structural variation analysis and comparative genomics. We finally comment on their functionality, strengths and weaknesses and we discuss how future applications could further develop in this field.

[1]  Peter M. Rice,et al.  The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants , 2009, Nucleic acids research.

[2]  Heinrich Magnus Manske,et al.  LookSeq: a browser-based viewer for deep sequencing data. , 2009, Genome research.

[3]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[4]  Mark A. Ragan,et al.  Seevolution: visualizing chromosome evolution , 2009, Bioinform..

[5]  Georgios A. Pavlopoulos,et al.  Meander: visually exploring the structural variome using space-filling curves , 2013, Nucleic acids research.

[6]  Eileen Kraemer,et al.  SynView: a GBrowse-compatible approach to visualizing comparative genome data , 2006, Bioinform..

[7]  Tamara Munzner,et al.  MizBee: A Multiscale Synteny Browser , 2009, IEEE Transactions on Visualization and Computer Graphics.

[8]  M. Gerstein,et al.  PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data , 2009, Genome Biology.

[9]  Amit U. Sinha,et al.  Cinteny: flexible analysis and visualization of synteny and genome rearrangements in multiple organisms , 2007, BMC Bioinformatics.

[10]  P. Stenson,et al.  The Human Gene Mutation Database (HGMD) and Its Exploitation in the Fields of Personalized Genomics and Molecular Evolution , 2012, Current protocols in bioinformatics.

[11]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[12]  Gabor T. Marth,et al.  EagleView: a genome assembly viewer for next-generation sequencing technologies. , 2008, Genome research.

[13]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[14]  Yan P. Yuan,et al.  HGBASE: a database of SNPs and other variations in and around human genes , 2000, Nucleic Acids Res..

[15]  Ying Li,et al.  TREAT: a bioinformatics tool for variant annotations and visualizations in targeted and exome sequencing data , 2011, Bioinform..

[16]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[17]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration , 2012, Briefings Bioinform..

[18]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[19]  Tom Royce,et al.  A comprehensive catalogue of somatic mutations from a human cancer genome , 2010, Nature.

[20]  Simon Anders,et al.  Visualisation of genomic data with the Hilbert curve , 2009 .

[21]  Steven J. M. Jones,et al.  Circos: an information aesthetic for comparative genomics. , 2009, Genome research.

[22]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.

[23]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[24]  Faraz Hach,et al.  Next-generation VariationHunter: combinatorial algorithms for transposon insertion discovery , 2010, Bioinform..

[25]  Vladimir Makarov,et al.  AnnTools: a comprehensive and versatile annotation toolkit for genomic variants , 2012, Bioinform..

[26]  Inanç Birol,et al.  ABySS-Explorer: Visualizing Genome Sequence Assemblies , 2009, IEEE Transactions on Visualization and Computer Graphics.

[27]  W. Pearson,et al.  Current Protocols in Bioinformatics , 2002 .

[28]  Lin Liu,et al.  Comparison of Next-Generation Sequencing Systems , 2012, Journal of biomedicine & biotechnology.

[29]  R. Mott,et al.  The 1001 Genomes Project for Arabidopsis thaliana , 2009, Genome Biology.

[30]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[31]  Lior Pachter,et al.  VISTA : visualizing global DNA sequence alignments of arbitrary length , 2000, Bioinform..

[32]  Daisuke Fujita,et al.  Perspectives and challenges of emerging single-molecule DNA sequencing technologies. , 2009, Small.

[33]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[34]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[35]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[36]  Vlad I. Morariu,et al.  Expression , 2015, Principles of Molecular Virology.

[37]  H. Tettelin,et al.  The microbial pan-genome. , 2005, Current opinion in genetics & development.

[38]  Philip Lijnzaad,et al.  The Ensembl genome database project , 2002, Nucleic Acids Res..

[39]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[40]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[41]  Adam M. Phillippy,et al.  Hawkeye and AMOS: visualizing and assessing the quality of genome assemblies , 2013, Briefings Bioinform..

[42]  Christopher A. Miller,et al.  VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. , 2012, Genome research.

[43]  Marylyn D. Ritchie,et al.  A comparison of cataloged variation between International HapMap Consortium and 1000 Genomes Project data , 2012, J. Am. Medical Informatics Assoc..

[44]  Peter J. Park,et al.  rSW-seq: Algorithm for detection of copy number alterations in deep sequencing data , 2010, BMC Bioinformatics.

[45]  Yan P. Yuan,et al.  HGVbase: a human sequence variation database emphasizing data quality and a broad spectrum of data sources , 2002, Nucleic Acids Res..

[46]  Paul Bertone,et al.  Systematic comparison of microarray profiling, real-time PCR, and next-generation sequencing technologies for measuring differential microRNA expression. , 2010, RNA.

[47]  L. Hillier,et al.  PCAP: a whole-genome assembly program. , 2003, Genome research.

[48]  Ann E. Loraine,et al.  The Integrated Genome Browser: free software for distribution and exploration of genome-scale datasets , 2009, Bioinform..

[49]  M. W. Foster,et al.  Integrating ethics and science in the International HapMap Project , 2004, Nature Reviews Genetics.

[50]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[51]  Reinhard Schneider,et al.  A survey of visualization tools for biological network analysis , 2008, BioData Mining.

[52]  Annelise E Barron,et al.  Advantages and limitations of next‐generation sequencing technologies: A comparison of electrophoresis and non‐electrophoresis methods , 2008, Electrophoresis.

[53]  David J. Cutler,et al.  SeqAnt: A web service to rapidly identify and annotate DNA sequence variations , 2010, BMC Bioinformatics.

[54]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.

[55]  Bang Wong,et al.  Visualizing biological data—now and in the future , 2010, Nature Methods.

[56]  Ryan D. Morin,et al.  Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing. , 2008, BioTechniques.

[57]  Ira M. Hall,et al.  Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome. , 2010, Genome research.

[58]  Rachel Karchin,et al.  Next generation tools for the annotation of human SNPs , 2009, Briefings Bioinform..

[59]  Paul Stothard,et al.  The CGView Server: a comparative genomics tool for circular genomes , 2008, Nucleic Acids Res..

[60]  Elizabeth T. Cirulli,et al.  SVA: software for annotating and visualizing sequenced human genomes , 2011, Bioinform..

[61]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[62]  Xuemei Deng,et al.  DHPC: a new tool to express genome structural features. , 2008, Genomics.

[63]  S. Gabriel,et al.  High-throughput oncogene mutation profiling in human cancer , 2007, Nature Genetics.

[64]  Pjotr Prins,et al.  BioRuby: bioinformatics software for the Ruby programming language , 2010, Bioinform..

[65]  I. Dubchak,et al.  Visualizing genomes: techniques and challenges , 2010, Nature Methods.

[66]  M. King,et al.  Genetic Heterogeneity in Human Disease , 2010, Cell.

[67]  Murat Sincan,et al.  VAR‐MD: A tool to analyze whole exome–genome variants in small human pedigrees with mendelian inheritance , 2012, Human mutation.

[68]  Xiaokun Li,et al.  MagicViewer: integrated solution for next-generation sequencing data visualization and genetic variation detection and annotation , 2010, Nucleic Acids Res..

[69]  T. Furey,et al.  Comparison of human (and other) genome browsers , 2006, Human Genomics.

[70]  T. Furey ChIP – seq and beyond : new and improved methodologies to detect and characterize protein – DNA interactions , 2012 .

[71]  Michal J. Okoniewski,et al.  X:Map: annotation and visualization of genome structure for Affymetrix exon array analysis , 2007, Nucleic Acids Res..

[72]  Ali Bashir,et al.  A geometric approach for classification and comparison of structural variants , 2009, Bioinform..

[73]  Jonathan Crabtree,et al.  Sybil: methods and software for multiple genome comparison and visualization. , 2007, Methods in molecular biology.

[74]  Paramvir S. Dehal,et al.  Whole-Genome Shotgun Assembly and Analysis of the Genome of Fugu rubripes , 2002, Science.

[75]  Gautier Koscielny,et al.  Ensembl 2012 , 2011, Nucleic Acids Res..

[76]  G. Weinstock,et al.  The Atlas genome assembly system. , 2004, Genome research.

[77]  Ken Chen,et al.  VarScan: variant detection in massively parallel sequencing of individual and pooled samples , 2009, Bioinform..

[78]  Inanç Birol,et al.  Detection and characterization of novel sequence insertions using paired-end next-generation sequencing , 2010, Bioinform..

[79]  M. Cline,et al.  Understanding genome browsing , 2009, Nature Biotechnology.

[80]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[81]  Hui Guo,et al.  MapView: visualization of short reads alignment on a desktop computer , 2009, Bioinform..

[82]  N. Pitman,et al.  Estimating the size of the world's threatened flora. , 2002, Science.

[83]  Eric D. Green,et al.  VarSifter: Visualizing and analyzing exome-scale sequence variation data on a desktop computer , 2012, Bioinform..

[84]  Ting Wang,et al.  The UCSC Cancer Genomics Browser , 2009, Nature Methods.

[85]  H. Hakonarson,et al.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data , 2010, Nucleic acids research.

[86]  S. Lewis,et al.  The generic genome browser: a building block for a model organism system database. , 2002, Genome research.

[87]  James Lowey,et al.  Bioinformatics Applications Note Sequence Analysis G-sqz: Compact Encoding of Genomic Sequence and Quality Data , 2022 .

[88]  Mihai Pop,et al.  Sequencing and genome assembly using next-generation technologies. , 2010, Methods in molecular biology.

[89]  M. Daly,et al.  Genetic Mapping in Human Disease , 2008, Science.

[90]  Faraz Hach,et al.  mrsFAST: a cache-oblivious algorithm for short-read mapping , 2010, Nature Methods.

[91]  George M. Church,et al.  Genomes for all. , 2006, Scientific American.

[92]  Yves Moreau,et al.  Annotate-it: a Swiss-knife approach to annotation, analysis and interpretation of single nucleotide variation in human disease , 2012, Genome Medicine.

[93]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[94]  Gudmundur A. Thorisson,et al.  The International HapMap Project Web site. , 2005, Genome research.

[95]  Szymon Grabowski,et al.  Compression of DNA sequence reads in FASTQ format , 2011, Bioinform..

[96]  D. Cook,et al.  ggbio: an R package for extending the grammar of graphics for genomic data , 2012, Genome Biology.

[97]  Runsheng Chen,et al.  GenomeComp: a visualization tool for microbial genome comparison. , 2003, Journal of microbiological methods.

[98]  Chih-Cheng Chen,et al.  VarioWatch: providing large-scale and comprehensive annotations on human genomic variants in the next generation sequencing era , 2012, Nucleic Acids Res..

[99]  Kai Ye,et al.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[100]  Tim Hubbard Finishing the euchromatic sequence of the human genome , 2004 .

[101]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[102]  J. Mullikin,et al.  The phusion assembler. , 2003, Genome research.

[103]  Masaru Tomita,et al.  Genome Projector: zoomable genome map with multiple views , 2009, BMC Bioinformatics.

[104]  Alexie Papanicolaou,et al.  The GMOD Drupal Bioinformatic Server Framework , 2010, Bioinform..

[105]  Paul Medvedev,et al.  Computational methods for discovering structural variation with next-generation sequencing , 2009, Nature Methods.

[106]  Joshua M. Stuart,et al.  Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. , 2009, The Journal of heredity.

[107]  Rebecca Cullum,et al.  The next generation: Using new sequencing technologies to analyse gene regulation , 2011, Respirology.

[108]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[109]  Matthew R. Pocock,et al.  The Bioperl toolkit: Perl modules for the life sciences. , 2002, Genome research.

[110]  C. Alkan,et al.  MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions , 2009, Nature Methods.

[111]  Jill P. Mesirov,et al.  Combo: a whole genome comparative browser , 2006, Bioinform..

[112]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[113]  K. Lange,et al.  Prioritizing GWAS results: A review of statistical methods and recommendations for their application. , 2010, American journal of human genetics.

[114]  Johnny S. H. Kwan,et al.  A comprehensive framework for prioritizing variants in exome sequencing studies of Mendelian diseases , 2012, Nucleic acids research.

[115]  Georgios A. Pavlopoulos,et al.  A reference guide for tree analysis and visualization , 2010, BioData Mining.

[116]  D. Botstein,et al.  Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease , 2003, Nature Genetics.

[117]  M. G. Reese,et al.  A probabilistic disease-gene finder for personal genomes. , 2011, Genome research.

[118]  Chao Xie,et al.  CNV-seq, a new method to detect copy number variation using high-throughput sequencing , 2009, BMC Bioinformatics.

[119]  Fangqing Zhao,et al.  inGAP-sv: a novel scheme to identify and visualize structural variation from paired end mapping data , 2011, Nucleic Acids Res..

[120]  N. Kyrpides,et al.  Direct Comparisons of Illumina vs. Roche 454 Sequencing Technologies on the Same Microbial Community DNA Sample , 2012, PloS one.

[121]  S. Bennett Solexa Ltd. , 2004, Pharmacogenomics.

[122]  Neil Hall,et al.  Advanced sequencing technologies and their wider impact in microbiology , 2007, Journal of Experimental Biology.