Robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework

Abstract Whole genome sequencing (WGS) is a promising strategy to unravel variants or genes responsible for human diseases and traits. However, there is a lack of robust platforms for a comprehensive downstream analysis. In the present study, we first proposed three novel algorithms, sequence gap-filled gene feature annotation, bit-block encoded genotypes and sectional fast access to text lines to address three fundamental problems. The three algorithms then formed the infrastructure of a robust parallel computing framework, KGGSeq, for integrating downstream analysis functions for whole genome sequencing data. KGGSeq has been equipped with a comprehensive set of analysis functions for quality control, filtration, annotation, pathogenic prediction and statistical tests. In the tests with whole genome sequencing data from 1000 Genomes Project, KGGSeq annotated several thousand more reliable non-synonymous variants than other widely used tools (e.g. ANNOVAR and SNPEff). It took only around half an hour on a small server with 10 CPUs to access genotypes of ∼60 million variants of 2504 subjects, while a popular alternative tool required around one day. KGGSeq's bit-block genotype format used 1.5% or less space to flexibly represent phased or unphased genotypes with multiple alleles and achieved a speed of over 1000 times faster to calculate genotypic correlation.

[1]  Pablo Cingolani,et al.  © 2012 Landes Bioscience. Do not distribute. , 2022 .

[2]  D. Goldstein,et al.  Uncovering the roles of rare variants in common disease through whole-genome sequencing , 2010, Nature Reviews Genetics.

[3]  C. Shaw,et al.  Multiallelic Positions in the Human Genome: Challenges for Genetic Analyses , 2016, Human mutation.

[4]  Shanrong Zhao,et al.  Rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing , 2013, BMC Genomics.

[5]  Xiaowei Zhan,et al.  RVTESTS: an efficient and comprehensive tool for rare variant association analysis using sequence data , 2016, Bioinform..

[6]  Johnny S. H. Kwan,et al.  A comprehensive framework for prioritizing variants in exome sequencing studies of Mendelian diseases , 2012, Nucleic acids research.

[7]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[8]  Heng Li,et al.  BGT: efficient and flexible genotype query across many samples , 2015, Bioinform..

[9]  Xihong Lin,et al.  Rare-variant association testing for sequencing data with the sequence kernel association test. , 2011, American journal of human genetics.

[10]  Laurie D. Smith,et al.  Whole-genome sequencing for identification of Mendelian disorders in critically ill infants: a retrospective analysis of diagnostic and clinical findings. , 2015, The Lancet. Respiratory medicine.

[11]  Pieter B. T. Neerincx,et al.  Supplementary Information Whole-genome sequence variation , population structure and demographic history of the Dutch population , 2022 .

[12]  Gonçalo R. Abecasis,et al.  Unified representation of genetic variants , 2015, Bioinform..

[13]  Michael R. Speicher,et al.  A survey of tools for variant analysis of next-generation genome sequencing data , 2013, Briefings Bioinform..

[14]  David Haussler,et al.  The UCSC Known Genes , 2006, Bioinform..

[15]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[16]  Feng Xu,et al.  Predicting regulatory variants with composite statistic , 2016, Bioinform..

[17]  Francesca Forzano,et al.  A specific mutation in TBL1XR1 causes Pierpont syndrome , 2016, Journal of Medical Genetics.

[18]  Trevor Hastie,et al.  REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. , 2016, American journal of human genetics.

[19]  Aaron R. Quinlan,et al.  GEMINI: Integrative Exploration of Genetic Variation and Genome Annotations , 2013, PLoS Comput. Biol..

[20]  Jing Yang,et al.  Exome sequencing identifies novel compound heterozygous mutations of IL-10 receptor 1 in neonatal-onset Crohn's disease , 2012, Genes and Immunity.

[21]  E. Boerwinkle,et al.  dbNSFP v3.0: A One‐Stop Database of Functional Predictions and Annotations for Human Nonsynonymous and Splice‐Site SNVs , 2016, Human mutation.

[22]  Daniel Rios,et al.  Bioinformatics Applications Note Databases and Ontologies Deriving the Consequences of Genomic Variants with the Ensembl Api and Snp Effect Predictor , 2022 .

[23]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[24]  Mulin Jun Li,et al.  wKGGSeq: A Comprehensive Strategy‐Based and Disease‐Targeted Online Framework to Facilitate Exome Sequencing Studies of Inherited Disorders , 2015, Human mutation.

[25]  Heng Li,et al.  Tabix: fast retrieval of sequence features from generic TAB-delimited files , 2011, Bioinform..

[26]  Johnny S. H. Kwan,et al.  Predicting Mendelian Disease-Causing Non-Synonymous Single Nucleotide Variants in Exome Sequencing Studies , 2013, PLoS genetics.

[27]  Aaron R. Quinlan,et al.  Efficient compression and analysis of large genetic variation datasets , 2015 .

[28]  H. Hakonarson,et al.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data , 2010, Nucleic acids research.

[29]  A. StAteMent Points to consider in the clinical application of genomic sequencing , 2012, Genetics in Medicine.

[30]  Aaron R. Quinlan,et al.  Efficient genotype compression and analysis of large genetic variation datasets , 2015, Nature Methods.

[31]  Iuliana Ionita-Laza,et al.  Sequence kernel association tests for the combined effect of rare and common variants. , 2013, American journal of human genetics.

[32]  Pak Chung Sham,et al.  Inheritance-mode specific pathogenicity prioritization (ISPP) for human protein coding genes , 2016, Bioinform..

[33]  Jun S. Liu,et al.  The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans , 2015, Science.

[34]  Jean-Baptiste Cazier,et al.  Choice of transcripts and software has a large effect on variant annotation , 2014, Genome Medicine.

[35]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[36]  Gill Bejerano,et al.  M-CAP eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity , 2016, Nature Genetics.

[37]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[38]  S. Leal,et al.  Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. , 2008, American journal of human genetics.

[39]  E. Lander Initial impact of the sequencing of the human genome , 2011, Nature.