Aquila: diploid personal genome assembly and comprehensive variant detection based on linked reads

Variant discovery in personal, whole genome sequence data is critical for uncovering the genetic contributions to health and disease. We introduce a new approach, Aquila, that uses linked-read data for generating a high quality diploid genome assembly, from which it then comprehensively detects and phases personal genetic variation. Assemblies cover >95% of the human reference genome, with over 98% in a diploid state. Thus, the assemblies support detection and accurate genotyping of the most prevalent types of human genetic variation, including single nucleotide polymorphisms (SNPs), small insertions and deletions (small indels), and structural variants (SVs), in all but the most difficult regions. All heterozygous variants are phased in blocks that can approach arm-level length. The final output of Aquila is a diploid and phased personal genome sequence, and a phased VCF file that also contains homozygous and a few unphased heterozygous variants. Aquila represents a cost-effective evolution of whole-genome reconstruction that can be applied to cohorts for variation discovery or association studies, or to single individuals with rare phenotypes that could be caused by SVs or compound heterozygosity.

[1]  Michael C. Schatz,et al.  Assemblytics: a web analytics tool for the detection of variants from an assembly , 2016, Bioinform..

[2]  Weida Tong,et al.  Direct comparison of performance of single nucleotide variant calling in human genome with alignment-based and assembly-based approaches , 2017, Scientific Reports.

[3]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[4]  Ken Chen,et al.  HySA: A Hybrid Structural variant Assembly approach using next generation and single-molecule sequencing technologies , 2016, bioRxiv.

[5]  G. Jin,et al.  Genome‐wide compound heterozygosity analysis highlighted 4 novel susceptibility loci for congenital heart disease in Chinese population , 2018, Clinical genetics.

[6]  Wouter De Coster,et al.  Structural variants identified by Oxford Nanopore PromethION sequencing of the human genome. , 2019, Genome research.

[7]  A. Yoder,et al.  The utility of PacBio circular consensus sequencing for characterizing complex gene families in non-model organisms , 2014, BMC Genomics.

[8]  Jian Wang,et al.  Efficient and unique cobarcoding of second-generation sequencing reads from long DNA molecules enabling cost-effective and accurate sequencing, haplotyping, and de novo assembly , 2019, Genome research.

[9]  Ryan L. Collins,et al.  Multi-platform discovery of haplotype-resolved structural variation in human genomes , 2017, bioRxiv.

[10]  Edwin Cuppen,et al.  Mapping and phasing of structural variation in patient genomes using nanopore sequencing , 2017, Nature Communications.

[11]  Benjamin J. Raphael,et al.  Identifying structural variants using linked-read sequencing data , 2017, bioRxiv.

[12]  S. Turner,et al.  A flexible and efficient template format for circular consensus sequencing and SNP detection , 2010, Nucleic acids research.

[13]  Chunlin Xiao,et al.  An open resource for accurately benchmarking small variant and reference calls , 2019, Nature Biotechnology.

[14]  Alexey A. Gurevich,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[15]  Peter J. Campbell,et al.  SvABA: Genome-wide detection of structural variants and indels by local assembly , 2017 .

[16]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[17]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[18]  D. Conrad,et al.  Global variation in copy number in the human genome , 2006, Nature.

[19]  Fei Wang,et al.  A study on fast calling variants from next-generation sequencing data using decision tree , 2018, BMC Bioinformatics.

[20]  Ian T. Fiddes,et al.  Resolving the full spectrum of human genome variation using Linked-Reads , 2019, Genome research.

[21]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[22]  Jan O. Korbel,et al.  Phenotypic impact of genomic structural variation: insights from and for human disease , 2013, Nature Reviews Genetics.

[23]  Kin-Fan Au,et al.  PacBio Sequencing and Its Applications , 2015, Genom. Proteom. Bioinform..

[24]  Arend Sidow,et al.  De novo diploid genome assembly for genome-wide structural variant detection , 2019, bioRxiv.

[25]  P. Stankiewicz,et al.  Whole-genome sequencing in a patient with Charcot-Marie-Tooth neuropathy. , 2010, The New England journal of medicine.

[26]  Evan E. Eichler,et al.  Characterizing the Major Structural Variant Alleles of the Human Genome , 2019, Cell.

[27]  Jonas Korlach,et al.  Discovery and genotyping of structural variation from long-read haploid genome sequence data , 2017, Genome research.

[28]  Anshul Kundaje,et al.  Umap and Bismap: quantifying genome and methylome mappability , 2016, bioRxiv.

[29]  Sergey Koren,et al.  Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome , 2019, Nature Biotechnology.

[30]  Serafim Batzoglou,et al.  Genome-wide reconstruction of complex structural variants using read clouds , 2016, Nature Methods.

[31]  De Coster Wouter,et al.  Structural variants identified by Oxford Nanopore PromethION sequencing of the human genome , 2018, bioRxiv.

[32]  Noah Spies,et al.  svviz: a read viewer for validating structural variants , 2015, bioRxiv.

[33]  Arend Sidow,et al.  Assessment of human diploid genome assembly with 10x Linked-Reads data , 2019, bioRxiv.

[34]  N. Weisenfeld,et al.  Direct determination of diploid genome sequences , 2016, bioRxiv.

[35]  Tomasz Stokowy,et al.  Comparison of three variant callers for human whole genome sequencing , 2018, Scientific Reports.

[36]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[37]  Hanlee P. Ji,et al.  Haplotyping germline and cancer genomes using high-throughput linked-read sequencing , 2015, Nature Biotechnology.

[38]  Adam M. Phillippy,et al.  Comparative genome assembly , 2004, Briefings Bioinform..

[39]  Li Ding,et al.  Multi-platform discovery of haplotype-resolved structural variation in human genomes , 2018, Nature Communications.