论文信息 - Efficient genotype compression and analysis of large genetic variation datasets

Efficient genotype compression and analysis of large genetic variation datasets

Genotype Query Tools (GQT) is an indexing strategy that expedites analyses of genome-variation data sets in Variant Call Format based on sample genotypes, phenotypes and relationships. GQT's compressed genotype index minimizes decompression for analysis, and its performance relative to that of existing methods improves with cohort size. We show substantial (up to 443-fold) gains in performance over existing methods and demonstrate GQT's utility for exploring massive data sets involving thousands to millions of genomes. GQT can be accessed at https://github.com/ryanlayer/gqt.

[1] Gary K. Chen,et al. Fast and flexible simulation of DNA sequence data. , 2008, Genome research.

[2] Kenny Q. Ye,et al. An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[3] Abraham Lempel,et al. A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[4] M. Schatz,et al. Big Data: Astronomical or Genomical? , 2015, PLoS biology.

[5] A. Clark,et al. Recent Explosive Human Population Growth Has Resulted in an Excess of Rare Genetic Variants , 2012, Science.

[6] Thiago Luís Lopes Siqueira,et al. The impact of spatial data redundancy on SOLAP query performance , 2009, Journal of the Brazilian Computer Society.

[7] M. Daly,et al. Searching for missing heritability: Designing rare variant association studies , 2014, Proceedings of the National Academy of Sciences.

[8] B. Weir,et al. ESTIMATING F‐STATISTICS FOR THE ANALYSIS OF POPULATION STRUCTURE , 1984, Evolution; international journal of organic evolution.

[9] Heng Li,et al. Tabix: fast retrieval of sequence features from generic TAB-delimited files , 2011, Bioinform..

[10] Hui Deng,et al. NVST DATA ARCHIVING SYSTEM BASED ON FASTBIT NOSQL DATABASE , 2014 .

[11] Gonçalo R. Abecasis,et al. The variant call format and VCFtools , 2011, Bioinform..

[12] Aaron R. Quinlan,et al. BIOINFORMATICS APPLICATIONS NOTE , 2022 .