VariantStore: A Large-Scale Genomic Variant Search Index

The ability to efficiently query genomic variants from thousands of samples is critical to achieving the full potential of many medical and scientific applications such as personalized medicine. Performing variant queries based on coordinates in the reference or sample sequences is at the core of these applications. Efficiently supporting variant queries across thousands of samples is computationally challenging. Most solutions only support queries based on the reference coordinates and the ones that support queries based on coordinates across multiple samples do not scale to data containing more than a few thousand samples. We present VariantStore, a system for efficiently indexing and querying genomic variants and their sequences in either the reference or sample-specific coordinate systems. We show the scalability of VariantStore by indexing genomic variants from the TCGA-BRCA project containing 8640 samples and 5M variants in 4 Hrs and the 1000 genomes project containing 2500 samples and 924M variants in 3 Hrs. Querying for variants in a gene takes between 0.002 – 3 seconds using memory only 10% of the size of the full representation.

[1]  Prashant Pandey,et al.  An Efficient, Scalable and Exact Representation of High-Dimensional Color Information Enabled via de Bruijn Graph Search , 2019, RECOMB.

[2]  Geir Kjetil Sandve,et al.  Coordinates and intervals in graph-based reference genomes , 2017, BMC Bioinformatics.

[3]  Gil McVean,et al.  Inferring whole-genome histories in large population datasets , 2019, Nature Genetics.

[4]  Prashant Pandey,et al.  Rainbowfish: A Succinct Colored de Bruijn Graph Representation , 2017, bioRxiv.

[5]  Alistair Moffat,et al.  From Theory to Practice: Plug and Play with Succinct Data Structures , 2013, SEA.

[6]  J. Lupski,et al.  Mechanisms underlying structural variant formation in genomic disorders , 2016, Nature Reviews Genetics.

[7]  L. Kruglyak,et al.  The role of regulatory variation in complex traits and disease , 2015, Nature Reviews Genetics.

[8]  N. Mulder,et al.  Whole-genome sequencing for an enhanced understanding of genetic variation among South Africans , 2017, Nature Communications.

[9]  Life Technologies,et al.  A map of human genome variation from population-scale sequencing , 2011 .

[10]  Aaron R. Quinlan,et al.  Efficient genotype compression and analysis of large genetic variation datasets , 2015, Nature Methods.

[11]  Ellen T. Gelfand,et al.  The Genotype-Tissue Expression (GTEx) project , 2013, Nature Genetics.

[12]  Doruk Beyter,et al.  PopDel identifies medium-size deletions jointly in tens of thousands of genomes , 2019, bioRxiv.

[13]  Michael A. Bender,et al.  A General-Purpose Counting Filter: Making Every Bit Count , 2017, SIGMOD Conference.

[14]  David Levine,et al.  SeqArray—a storage‐efficient high‐performance data format for WGS variant calls , 2017, Bioinform..

[15]  Paul Flicek,et al.  The International Genome Sample Resource (IGSR) collection of open human genomic variation resources , 2019, Nucleic Acids Res..

[16]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets , 2007, ACM Trans. Algorithms.

[17]  Steven L Salzberg,et al.  Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype , 2019, Nature Biotechnology.

[18]  K. Eilbeck,et al.  Settling the score: variant prioritization and Mendelian disease , 2017, Nature Reviews Genetics.

[19]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[20]  Kari Stefansson,et al.  Graphtyper enables population-scale genotyping using pangenome graphs , 2017, Nature Genetics.

[21]  William Jones,et al.  Variation graph toolkit improves read mapping by representing genetic variation in the reference , 2018, Nature Biotechnology.

[22]  Patrick May,et al.  Variant-DB: A Tool for Efficiently Exploring Millions of Human Genetic Variants and Their Annotations , 2017, DILS.

[23]  Prashant Pandey,et al.  An Efficient, Scalable, and Exact Representation of High-Dimensional Color Information Enabled Using de Bruijn Graph Search. , 2020, Journal of computational biology : a journal of computational molecular cell biology.

[24]  Heng Li,et al.  BGT: efficient and flexible genotype query across many samples , 2015, Bioinform..

[25]  G. Bourque,et al.  Personalized and graph genomes reveal missing signal in epigenomic data , 2020, Genome Biology.

[26]  Geert Vandeweyer,et al.  VariantDB: a flexible annotation and filtering portal for next generation sequencing data , 2014, Genome Medicine.

[27]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[28]  Sebastian Deorowicz,et al.  GTC: how to maintain huge genotype collections in a compressed form , 2018, Bioinform..

[29]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[30]  Gil McVean,et al.  Improved genome inference in the MHC using a population reference graph , 2014, Nature Genetics.

[31]  Wan-Ping Lee,et al.  Fast and accurate genomic analyses using genome graphs , 2019, Nature Genetics.

[32]  Peter N. Robinson,et al.  Human genotype–phenotype databases: aims, challenges and opportunities , 2015, Nature Reviews Genetics.