Challenges, Solutions, and Quality Metrics of Personal Genome Assembly in Advancing Precision Medicine

Even though each of us shares more than 99% of the DNA sequences in our genome, there are millions of sequence codes or structure in small regions that differ between individuals, giving us different characteristics of appearance or responsiveness to medical treatments. Currently, genetic variants in diseased tissues, such as tumors, are uncovered by exploring the differences between the reference genome and the sequences detected in the diseased tissue. However, the public reference genome was derived with the DNA from multiple individuals. As a result of this, the reference genome is incomplete and may misrepresent the sequence variants of the general population. The more reliable solution is to compare sequences of diseased tissue with its own genome sequence derived from tissue in a normal state. As the price to sequence the human genome has dropped dramatically to around $1000, it shows a promising future of documenting the personal genome for every individual. However, de novo assembly of individual genomes at an affordable cost is still challenging. Thus, till now, only a few human genomes have been fully assembled. In this review, we introduce the history of human genome sequencing and the evolution of sequencing platforms, from Sanger sequencing to emerging “third generation sequencing” technologies. We present the currently available de novo assembly and post-assembly software packages for human genome assembly and their requirements for computational infrastructures. We recommend that a combined hybrid assembly with long and short reads would be a promising way to generate good quality human genome assemblies and specify parameters for the quality assessment of assembly outcomes. We provide a perspective view of the benefit of using personal genomes as references and suggestions for obtaining a quality personal genome. Finally, we discuss the usage of the personal genome in aiding vaccine design and development, monitoring host immune-response, tailoring drug therapy and detecting tumors. We believe the precision medicine would largely benefit from bioinformatics solutions, particularly for personal genome assembly.

[1]  D. G. MacArthur,et al.  Guidelines for investigating causality of sequence variants in human disease , 2014, Nature.

[2]  Thomas M. Keane,et al.  ABACAS: algorithm-based automatic contiguation of assembled sequences , 2009, Bioinform..

[3]  Benedict Paten,et al.  Improved data analysis for the MinION nanopore sequencer , 2015, Nature Methods.

[4]  Miriam L. Land,et al.  Evaluation and validation of de novo and hybrid assembly techniques to derive high-quality genome sequences , 2014, Bioinform..

[5]  C. Nusbaum,et al.  ALLPATHS: de novo assembly of whole-genome shotgun microreads. , 2008, Genome research.

[6]  Ying Cheng,et al.  Improvements to services at the European Nucleotide Archive , 2009, Nucleic Acids Res..

[7]  I. Robinson,et al.  A simple filtration technique for obtaining purified human chromosomes in suspension. , 2014, BioTechniques.

[8]  E. Eichler,et al.  Characterization of Missing Human Genome Sequences and Copy-number Polymorphic Insertions , 2010, Nature Methods.

[9]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[10]  Shufeng Zhou,et al.  Polymorphism of human cytochrome P450 enzymes and its clinical impact , 2009, Drug metabolism reviews.

[11]  Nan Li,et al.  Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph. , 2012, Briefings in functional genomics.

[12]  D. Logan Do you smell what I smell? Genetic variation in olfactory perception. , 2014, Biochemical Society transactions.

[13]  Swee Jin Tan,et al.  A Microfluidic Device for Preparing Next Generation DNA Sequencing Libraries and for Automating Other Laboratory Protocols That Require One or More Column Chromatography Steps , 2013, PloS one.

[14]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.

[15]  H. Swerdlow,et al.  Capillary gel electrophoresis for rapid, high resolution DNA sequencing. , 1990, Nucleic acids research.

[16]  Jerzy K. Kulski,et al.  The HLA genomic loci map: expression, interaction, diversity and disease , 2009, Journal of Human Genetics.

[17]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Evan E. Eichler,et al.  Genetic variation and the de novo assembly of human genomes , 2015, Nature Reviews Genetics.

[19]  Alessio Mengoni,et al.  CONTIGuator: a bacterial genomes finishing tool for structural insights on draft genomes , 2011, Source Code for Biology and Medicine.

[20]  Clive E. Bowman,et al.  Genetic variations in HLA-B region and hypersensitivity reactions to abacavir , 2002, The Lancet.

[21]  David Haussler,et al.  Using native and syntenically mapped cDNA alignments to improve de novo gene finding , 2008, Bioinform..

[22]  Katharina J. Hoff,et al.  BRAKER1: Unsupervised RNA-Seq-Based Genome Annotation with GeneMark-ET and AUGUSTUS , 2016, Bioinform..

[23]  Nuno A. Fonseca,et al.  Assemblathon 1: a competitive assessment of de novo short read assembly methods. , 2011, Genome research.

[24]  S. Koren,et al.  Assembly algorithms for next-generation sequencing data. , 2010, Genomics.

[25]  Bairong Shen,et al.  A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies , 2011, PloS one.

[26]  Paulo F. Pires,et al.  GARSA: genomic analysis resources for sequence annotation , 2005, Bioinform..

[27]  Jakob Grove,et al.  Novel variation and de novo mutation rates in population-wide de novo assembled Danish trios , 2015, Nature Communications.

[28]  Michael R. Johnson,et al.  HLA-A*3101 and carbamazepine-induced hypersensitivity reactions in Europeans. , 2011, The New England journal of medicine.

[29]  Samuel A. Assefa,et al.  A post-assembly genome-improvement toolkit (PAGIT) to obtain annotated genomes from contigs , 2012, Nature Protocols.

[30]  Timothy P. L. Smith,et al.  Reducing assembly complexity of microbial genomes with single-molecule sequencing , 2013, Genome Biology.

[31]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[32]  M. Yandell,et al.  A beginner's guide to eukaryotic genome annotation , 2012, Nature Reviews Genetics.

[33]  S. Tonegawa,et al.  Somatic generation of antibody diversity. , 1976, Nature.

[34]  Deanna M. Church,et al.  ClinVar: public archive of relationships among sequence variation and human phenotype , 2013, Nucleic Acids Res..

[35]  VK Tiwari Genome Mapping , 2008, Encyclopedia of GIS.

[36]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[37]  Xia Yang,et al.  Systematic genetic and genomic analysis of cytochrome P450 enzyme activities in human liver. , 2010, Genome research.

[38]  Johnf . Thompson,et al.  Single Molecule Sequencing with a HeliScope Genetic Analysis System , 2010, Current protocols in molecular biology.

[39]  M. Relling,et al.  Pharmacogenomics: translating functional genomics into rational therapeutics. , 1999, Science.

[40]  Lloyd M. Smith,et al.  Fluorescence detection in automated DNA sequence analysis , 1986, Nature.

[41]  Stephen R Quake,et al.  Whole-genome molecular haplotyping of single cells , 2011, Nature Biotechnology.

[42]  J. Roach,et al.  Pairwise end sequencing: a unified approach to genomic mapping and sequencing. , 1995, Genomics.

[43]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[44]  Raja Mazumder,et al.  High-Performance Integrated Virtual Environment (HIVE) Tools and Applications for Big Data Analysis , 2014, Genes.

[45]  Katharina J. Hoff,et al.  WebAUGUSTUS—a web service for training AUGUSTUS and predicting genes in eukaryotes , 2013, Nucleic Acids Res..

[46]  International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome , 2004 .

[47]  Juliane C. Dohm,et al.  SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. , 2007, Genome research.

[48]  V. Willour,et al.  Construction of a 750-kb bacterial clone contig and restriction map in the region of human chromosome 21 containing the progressive myoclonus epilepsy gene. , 1996, Genome research.

[49]  M. Schatz,et al.  Metassembler: merging and optimizing de novo genome assemblies , 2015, Genome Biology.

[50]  T. Fukami,et al.  CYP2A7 Pseudogene Transcript Affects CYP2A6 Expression in Human Liver by Acting as a Decoy for miR-126* , 2015, Drug Metabolism and Disposition.

[51]  D. Lewis,et al.  Evolution of the cytochrome P450 superfamily: sequence alignments and pharmacogenetics. , 1998, Mutation research.

[52]  P. Day,et al.  High-throughput droplet PCR. , 2010, Methods.

[53]  I. Nookaew,et al.  Insights from 20 years of bacterial genome sequencing , 2015, Functional & Integrative Genomics.

[54]  Jens Stoye,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2009 .

[55]  Gunnar Rätsch,et al.  mGene.web: a web service for accurate computational gene finding , 2009, Nucleic Acids Res..

[56]  M. Schatz,et al.  Hybrid error correction and de novo assembly of single-molecule sequencing reads , 2012, Nature Biotechnology.

[57]  Loren H. Rieseberg,et al.  De Novo Genome Assembly of the Economically Important Weed Horseweed Using Integrated Data from Multiple Sequencing Platforms1[C][W][OPEN] , 2014, Plant Physiology.

[58]  E. Eichler,et al.  Limitations of next-generation genome sequence assembly , 2011, Nature Methods.

[59]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[60]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[61]  F. Sanger,et al.  A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. , 1975, Journal of molecular biology.

[62]  G. Dougan,et al.  The Key Role of Genomics in Modern Vaccine and Drug Design for Emerging Infectious Diseases , 2009, PLoS genetics.

[63]  Oscar P. Kuipers,et al.  Projector 2: contig mapping for efficient gap-closure of prokaryotic genome sequence assemblies , 2005, Nucleic Acids Res..

[64]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[65]  René L. Warren,et al.  Assembling millions of short DNA sequences using SSAKE , 2006, Bioinform..

[66]  Kenny Q. Ye,et al.  Mapping copy number variation by population scale genome sequencing , 2010, Nature.

[67]  H. McLeod,et al.  Genetic basis of drug metabolism. , 2002, American journal of health-system pharmacy : AJHP : official journal of the American Society of Health-System Pharmacists.

[68]  M. Pop,et al.  Sequence assembly demystified , 2013, Nature Reviews Genetics.

[69]  Huanming Yang,et al.  Structural variation in two human genomes mapped at single-nucleotide resolution by whole genome de novo assembly , 2011, Nature Biotechnology.

[70]  Joshua M. Stuart,et al.  Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection , 2015, Nature Methods.

[71]  Dawei Li,et al.  The diploid genome sequence of an Asian individual , 2008, Nature.

[72]  Jian Wang,et al.  De novo assembly of a haplotype-resolved human genome , 2015, Nature Biotechnology.

[73]  Adrian W. Briggs,et al.  Analysis of one million base pairs of Neanderthal DNA , 2006, Nature.

[74]  R. Guigó,et al.  GeneID in Drosophila. , 2000, Genome research.

[75]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[76]  Gabor T. Marth,et al.  Rapid whole-genome mutational profiling using next-generation sequencing technologies. , 2008, Genome research.

[77]  Matthew Berriman,et al.  Iterative Correction of Reference Nucleotides (iCORN) using second generation sequencing technology , 2010, Bioinform..

[78]  Archbishop Desmond Tutu,et al.  Human genome at ten: The sequence explosion , 2010, Nature.

[79]  S. Koren,et al.  One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly. , 2015, Current opinion in microbiology.

[80]  Alexa B. R. McIntyre,et al.  Extensive sequencing of seven human genomes to characterize benchmark reference materials , 2015, Scientific Data.

[81]  A. Casrouge,et al.  A Direct Estimate of the Human αβ T Cell Receptor Diversity , 1999 .

[82]  Jan-Ming Ho,et al.  A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework , 2012, BMC Genomics.

[83]  Ying Cheng,et al.  The European Nucleotide Archive , 2010, Nucleic Acids Res..

[84]  Larry J Kricka,et al.  Performance of exome sequencing for pharmacogenomics. , 2015, Personalized medicine.

[85]  Alexey A. Gurevich,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[86]  Terry Ng,et al.  An ensemble strategy that significantly improves de novo assembly of microbial genomes from metagenomic next-generation sequencing data , 2015, Nucleic acids research.

[87]  J. Trowsdale,et al.  Major histocompatibility complex structure and function. , 1989, Current opinion in immunology.

[88]  K. Frazer,et al.  Microdroplet-based PCR amplification for large scale targeted sequencing , 2009, Nature Biotechnology.

[89]  F. Collins,et al.  A new initiative on precision medicine. , 2015, The New England journal of medicine.

[90]  Mario Stanke,et al.  Gene prediction with a hidden Markov model and a new intron submodel , 2003, ECCB.

[91]  Inanç Birol,et al.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species , 2013, GigaScience.

[92]  D. Conrad,et al.  Global variation in copy number in the human genome , 2006, Nature.

[93]  Cheng Soon Ong,et al.  mGene: accurate SVM-based gene finding with an application to nematode genomes. , 2009, Genome research.

[94]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[95]  Russell E. Durrett,et al.  Assembly and diploid architecture of an individual human genome via single-molecule technologies , 2015, Nature Methods.

[96]  Ana Tereza Ribeiro de Vasconcelos,et al.  A System for Automated Bacterial (genome) Integrated Annotation - SABIA , 2004, Bioinform..

[97]  Vincent J. Magrini,et al.  Extending assembly of short DNA sequences to handle error , 2007, Bioinform..

[98]  D. K. Sharma,et al.  Molecular drug targets and structure based drug design: A holistic approach , 2006, Bioinformation.

[99]  M. Guyer,et al.  Charting a course for genomic medicine from base pairs to bedside , 2011, Nature.

[100]  John C. Marioni,et al.  Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data , 2009, Bioinform..

[101]  J. Luckey,et al.  High-speed separations of DNA sequencing reactions by capillary electrophoresis. , 1990, Analytical chemistry.

[102]  German Tischler,et al.  Next-generation sequencing and large genome assemblies. , 2012, Pharmacogenomics.

[103]  Beyond the reference genome , 2015, Nature Biotechnology.

[104]  Jay Shendure,et al.  Decoding long nanopore sequencing reads of natural DNA , 2014, Nature Biotechnology.

[105]  M. Berriman,et al.  Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps , 2010, Genome Biology.

[106]  Guohui Yao,et al.  Graph accordance of next-generation sequence assemblies , 2012, Bioinform..

[107]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[108]  Ilkay Altintas,et al.  Distributed workflow-driven analysis of large-scale biological data using biokepler , 2011, PDAC '11.

[109]  Ian Korf,et al.  Gene finding in novel genomes , 2004, BMC Bioinformatics.

[110]  S. Turner,et al.  Real-time DNA sequencing from single polymerase molecules. , 2010, Methods in enzymology.

[111]  Daniel H. Huson,et al.  OSLay: optimal syntenic layout of unfinished assemblies , 2007, Bioinform..

[112]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[113]  J. Lupski,et al.  The complete genome of an individual by massively parallel DNA sequencing , 2008, Nature.

[114]  M. Berriman,et al.  REAPR: a universal tool for genome assembly evaluation , 2013, Genome Biology.

[115]  A. Casrouge,et al.  A direct estimate of the human alphabeta T cell receptor diversity. , 1999, Science.

[116]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[117]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[118]  Daniel Mapleson,et al.  RAMPART: a workflow management system for de novo genome assembly , 2015, Bioinform..

[119]  Michael C. Schatz,et al.  Oxford Nanopore Sequencing, Hybrid Error Correction, and de novo Assembly of a Eukaryotic Genome , 2015 .

[120]  Gil McVean,et al.  Improved genome inference in the MHC using a population reference graph , 2014, Nature Genetics.

[121]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[122]  Thomas D. Otto,et al.  RATT: Rapid Annotation Transfer Tool , 2011, Nucleic acids research.

[123]  S. Karlin,et al.  Finding the genes in genomic DNA. , 1998, Current opinion in structural biology.

[124]  Alexander S. Mikheyev,et al.  A first look at the Oxford Nanopore MinION sequencer , 2014, Molecular ecology resources.

[125]  Demetrius J Porche,et al.  Precision Medicine Initiative , 2015, American journal of men's health.

[126]  Yuji Takahashi,et al.  Rapid detection of expanded short tandem repeats in personal genomics using hybrid sequencing , 2013, Bioinform..

[127]  Jan Vrána,et al.  Chromosomes in the flow to simplify genome analysis , 2012, Functional & Integrative Genomics.

[128]  L. M. Smith,et al.  High speed DNA sequencing by capillary electrophoresis. , 1990, Nucleic acids research.

[129]  Leming Shi,et al.  Gene Expression Variability in Human Hepatic Drug Metabolizing Enzymes and Transporters , 2013, PloS one.

[130]  Tieliu Shi,et al.  Re-annotation of presumed noncoding disease/trait-associated genetic variants by integrative analyses , 2015, Scientific Reports.

[131]  J. Rothberg,et al.  Overview: methods and applications for droplet compartmentalization of biology , 2006, Nature Methods.

[132]  Ralph H. Scheicher,et al.  Double-functionalized nanopore-embedded gold electrodes for rapid DNA sequencing , 2012 .

[133]  Xiaoqiu Huang,et al.  Generating a Genome Assembly with PCAP , 2005, Current protocols in bioinformatics.

[134]  Thomas D. Wu,et al.  A highly annotated whole-genome sequence of a Korean individual , 2009, Nature.

[135]  Steven Salzberg,et al.  GAGE-B: an evaluation of genome assemblers for bacterial organisms , 2013, Bioinform..

[136]  A. K.,et al.  Homo Sapiens , 1947, Nature.