Haplotype-aware graph indexes

Motivation The variation graph toolkit (VG) represents genetic variation as a graph. Although each path in the graph is a potential haplotype, most paths are nonbiological, unlikely recombinations of true haplotypes. Results We augment the VG model with haplotype information to identify which paths are more likely to exist in nature. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows–Wheelertransform (GBWT). We demonstrate the scalability of the new implementation by building a whole-genome index of the 5,008 haplotypes of the 1000 Genomes Project, and an index of all 108,070 TOPMed Freeze 5 chromosome 17 haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes. Availability Our software is available at https://github.com/vgteam/vg, https://github.com/jltsiren/gbwt, and https://github.com/jltsiren/gcsa2. Contact jouni.siren@iki.fi Supplementary information Supplementary data are available.

[1]  Zaid Al-Ars,et al.  CHOP: haplotype-aware path indexing in population graphs , 2018, bioRxiv.

[2]  N. Warthmann,et al.  Simultaneous alignment of short reads against multiple genomes , 2009, Genome Biology.

[3]  Alistair Moffat,et al.  From Theory to Practice: Plug and Play with Succinct Data Structures , 2013, SEA.

[4]  Jouni Sirén Burrows-Wheeler Transform for Terabases , 2016, 2016 Data Compression Conference (DCC).

[5]  Veli Mäkinen,et al.  Indexing Graphs for Path Queries with Applications in Genome Research , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[6]  Christina Boucher,et al.  Efficient Construction of a Complete Index for Pan-Genomics Read Alignment , 2018, bioRxiv.

[7]  Ying Zhang,et al.  Computational pan-genomics: status, promises and challenges , 2016, bioRxiv.

[8]  Giovanna Rosone,et al.  Lightweight algorithms for constructing and inverting the BWT of string collections , 2013, Theor. Comput. Sci..

[9]  Meng He,et al.  Indexing Compressed Text , 2003 .

[10]  Joong Chae Na,et al.  FM-index of Alignment with Gaps , 2016, Theor. Comput. Sci..

[11]  Jouni Sirén,et al.  Indexing Variation Graphs , 2016, ALENEX.

[12]  Benedict Paten,et al.  A graph extension of the positional Burrows–Wheeler transform and its applications , 2016, Algorithms for Molecular Biology.

[13]  Heng Li,et al.  Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly , 2012, Bioinform..

[14]  Jordan M. Eizenga,et al.  Genome graphs and the evolution of genome inference , 2017, bioRxiv.

[15]  Wan-Ping Lee,et al.  Fast and accurate genomic analyses using genome graphs , 2019, Nature Genetics.

[16]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[17]  Wing-Kai Hon,et al.  Compressed indexes for dynamic text collections , 2007, TALG.

[18]  Travis Gagie,et al.  Wheeler graphs: A framework for BWT-based data structures☆ , 2017, Theor. Comput. Sci..

[19]  Kari Stefansson,et al.  Graphtyper enables population-scale genotyping using pangenome graphs , 2017, Nature Genetics.

[20]  William Jones,et al.  Variation graph toolkit improves read mapping by representing genetic variation in the reference , 2018, Nature Biotechnology.

[21]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[22]  Christina Boucher,et al.  Efficient Construction of a Complete Index for Pan-Genomics Read Alignment , 2019, RECOMB.

[23]  Siu-Ming Yiu,et al.  Indexing Similar DNA Sequences , 2010, AAIM.

[24]  Richard Durbin,et al.  Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT) , 2014, Bioinform..

[25]  Heng Li Fast construction of FM-index for long sequence reads , 2014, Bioinform..

[26]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[27]  Gonzalo Navarro,et al.  Storage and Retrieval of Highly Repetitive Sequence Collections , 2010, J. Comput. Biol..

[28]  Lin Huang,et al.  Short read alignment with populations of genomes , 2013, Bioinform..

[29]  Benedict Paten,et al.  Modelling haplotypes with respect to reference cohort variation graphs , 2017, bioRxiv.

[30]  Joong Chae Na,et al.  FM-index of alignment: A compressed index for similar strings , 2016, Theor. Comput. Sci..

[31]  Gonzalo Navarro,et al.  Optimal-Time Text Indexing in BWT-runs Bounded Space , 2017, SODA.