Haplotype-aware graph indexes

Abstract Motivation The variation graph toolkit (VG) represents genetic variation as a graph. Although each path in the graph is a potential haplotype, most paths are non-biological, unlikely recombinations of true haplotypes. Results We augment the VG model with haplotype information to identify which paths are more likely to exist in nature. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows–Wheeler transform. We demonstrate the scalability of the new implementation by building a whole-genome index of the 5008 haplotypes of the 1000 Genomes Project, and an index of all 108 070 Trans-Omics for Precision Medicine Freeze 5 chromosome 17 haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes. Availability and implementation Our software is available at https://github.com/vgteam/vg, https://github.com/jltsiren/gbwt and https://github.com/jltsiren/gcsa2. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Christina Boucher,et al.  Efficient Construction of a Complete Index for Pan-Genomics Read Alignment , 2018, bioRxiv.

[2]  Wan-Ping Lee,et al.  Fast and accurate genomic analyses using genome graphs , 2019, Nature Genetics.

[3]  Wing-Kai Hon,et al.  Compressed indexes for dynamic text collections , 2007, TALG.

[4]  Meng He,et al.  Indexing Compressed Text , 2003 .

[5]  Travis Gagie,et al.  Wheeler graphs: A framework for BWT-based data structures☆ , 2017, Theor. Comput. Sci..

[6]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[7]  Zaid Al-Ars,et al.  CHOP: haplotype-aware path indexing in population graphs , 2018, bioRxiv.

[8]  Benedict Paten,et al.  A graph extension of the positional Burrows–Wheeler transform and its applications , 2016, Algorithms for Molecular Biology.

[9]  Heng Li,et al.  Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly , 2012, Bioinform..

[10]  Heng Li Fast construction of FM-index for long sequence reads , 2014, Bioinform..

[11]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[12]  Yutaka Suzuki,et al.  MoMI-G: modular multi-scale integrated genome graph browser , 2019, BMC Bioinformatics.

[13]  The Computational Pan-Genomics Consortium,et al.  Computational pan-genomics: status, promises and challenges , 2018, Briefings Bioinform..

[14]  Alistair Moffat,et al.  From Theory to Practice: Plug and Play with Succinct Data Structures , 2013, SEA.

[15]  Giovanna Rosone,et al.  Lightweight algorithms for constructing and inverting the BWT of string collections , 2013, Theor. Comput. Sci..

[16]  Benedict Paten,et al.  Genotyping structural variants in pangenome graphs using the vg toolkit , 2020, Genome Biology.

[17]  N. Warthmann,et al.  Simultaneous alignment of short reads against multiple genomes , 2009, Genome Biology.

[18]  Richard Durbin,et al.  Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT) , 2014, Bioinform..

[19]  Kari Stefansson,et al.  Graphtyper enables population-scale genotyping using pangenome graphs , 2017, Nature Genetics.

[20]  William Jones,et al.  Variation graph toolkit improves read mapping by representing genetic variation in the reference , 2018, Nature Biotechnology.

[21]  Jordan M. Eizenga,et al.  Genome graphs and the evolution of genome inference , 2017, bioRxiv.

[22]  Joong Chae Na,et al.  FM-index of Alignment with Gaps , 2016, Theor. Comput. Sci..

[23]  Jouni Sirén,et al.  Indexing Variation Graphs , 2016, ALENEX.

[24]  Siu-Ming Yiu,et al.  Indexing Similar DNA Sequences , 2010, AAIM.

[25]  Gonzalo Navarro,et al.  Storage and Retrieval of Highly Repetitive Sequence Collections , 2010, J. Comput. Biol..

[26]  Lin Huang,et al.  Short read alignment with populations of genomes , 2013, Bioinform..

[27]  Benedict Paten,et al.  Modelling haplotypes with respect to reference cohort variation graphs , 2017, bioRxiv.

[28]  Joong Chae Na,et al.  FM-index of alignment: A compressed index for similar strings , 2016, Theor. Comput. Sci..

[29]  Gonzalo Navarro,et al.  Optimal-Time Text Indexing in BWT-runs Bounded Space , 2017, SODA.

[30]  Jouni Sirén Burrows-Wheeler Transform for Terabases , 2016, 2016 Data Compression Conference (DCC).

[31]  Veli Mäkinen,et al.  Indexing Graphs for Path Queries with Applications in Genome Research , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[32]  Ying Zhang,et al.  Computational pan-genomics: status, promises and challenges , 2016, bioRxiv.

[33]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.