论文信息 - VarGenius executes cohort-level DNA-seq variant calling and annotation and allows to manage the resulting data through a PostgreSQL database - 字舞流文

VarGenius executes cohort-level DNA-seq variant calling and annotation and allows to manage the resulting data through a PostgreSQL database

BackgroundTargeted resequencing has become the most used and cost-effective approach for identifying causative mutations of Mendelian diseases both for diagnostics and research purposes. Due to very rapid technological progress, NGS laboratories are expanding their capabilities to address the increasing number of analyses. Several open source tools are available to build a generic variant calling pipeline, but a tool able to simultaneously execute multiple analyses, organize, and categorize the samples is still missing.ResultsHere we describe VarGenius, a Linux based command line software able to execute customizable pipelines for the analysis of multiple targeted resequencing data using parallel computing. VarGenius provides a database to store the output of the analysis (calling quality statistics, variant annotations, internal allelic variant frequencies) and sample information (personal data, genotypes, phenotypes). VarGenius can also perform the “joint analysis” of hundreds of samples with a single command, drastically reducing the time for the configuration and execution of the analysis.VarGenius executes the standard pipeline of the Genome Analysis Tool-Kit (GATK) best practices (GBP) for germinal variant calling, annotates the variants using Annovar, and generates a user-friendly output displaying the results through a web page.VarGenius has been tested on a parallel computing cluster with 52 machines with 120GB of RAM each. Under this configuration, a 50 M whole exome sequencing (WES) analysis for a family was executed in about 7 h (trio or quartet); a joint analysis of 30 WES in about 24 h and the parallel analysis of 34 single samples from a 1 M panel in about 2 h.ConclusionsWe developed VarGenius, a “master” tool that faces the increasing demand of heterogeneous NGS analyses and allows maximum flexibility for downstream analyses. It paves the way to a different kind of analysis, centered on cohorts rather than on singleton. Patient and variant information are stored into the database and any output file can be accessed programmatically. VarGenius can be used for routine analyses by biomedical researchers with basic Linux skills providing additional flexibility for computational biologists to develop their own algorithms for the comparison and analysis of data.The software is freely available at: https://github.com/frankMusacchia/VarGenius

Michele Pinelli | Margherita Mutarelli | Francesco Musacchia | Alessandro Bruselles | Swaraj Basu | A. Ciolfi | R. Castello | S. Banfi | G. Casari | M. Tartaglia | V. Nigro | N. Brunetti‐Pierri | A. Bruselles | M. Tartaglia | S. Banfi | M. Pinelli | G. Casari | A. Ciolfi | S. Basu | M. Mutarelli | A. Selicorni | S. Maitz | G. Parenti | G. Cappuccio | A. Torella | V. Nigro | G. Mancano | R. Castello | F. Musacchia | G. Esposito | V. Nigro | G. Casari | Raffaele Annalaura Gaia Francesco Margherita Gerarda Michel Castello Torella Esposito Musacchia Muta | G. Casari | S. Banfi | V. Nigro

[1] Manolis Kellis,et al. Interpreting non-coding variation in complex disease genetics , 2012, Nature Biotechnology.

[2] James Y. Zou. Analysis of protein-coding genetic variation in 60,706 humans , 2015, Nature.

[3] Mattia D'Antonio,et al. WEP: a high-performance analysis pipeline for whole-exome data , 2013, BMC Bioinformatics.

[4] Gennaro Oliva,et al. A community-based resource for automatic exome variant-calling and annotation in Mendelian disorders , 2014, BMC Genomics.

[5] Florentino Fernández Riverola,et al. RUbioSeq+: A multiplatform application that executes parallelized pipelines to analyse next-generation sequencing data , 2017, Comput. Methods Programs Biomed..

[6] Elizabeth M. Smigielski,et al. dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[7] Gabor T. Marth,et al. A global reference for human genetic variation , 2015, Nature.

[8] Mauricio O. Carneiro,et al. From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline , 2013, Current protocols in bioinformatics.

[9] Alicia R. Martin,et al. STORMSeq: An Open-Source, User-Friendly Pipeline for Processing Personal Genomics Data in the Cloud , 2014, PloS one.

[10] D. Goldstein,et al. Genic Intolerance to Functional Variation and the Interpretation of Personal Genomes , 2013, PLoS genetics.

[11] J. Zook,et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls , 2013, Nature Biotechnology.

[12] Gonçalo R. Abecasis,et al. The variant call format and VCFtools , 2011, Bioinform..

[13] Adam Kiezun,et al. Exome Aggregation Consortium , 2016 .

[14] Stephen B. Montgomery,et al. Detection and Impact of Rare Regulatory Variants in Human Disease , 2013, Front. Genet..

[15] I. Tikhonova,et al. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing , 2009, Proceedings of the National Academy of Sciences.

[16] ExAC project pins down rare gene variants , 2016, Nature.

[17] E. Boerwinkle,et al. dbNSFP: A Lightweight Database of Human Nonsynonymous SNPs and Their Functional Predictions , 2011, Human mutation.

[18] Jason Flannick,et al. Evaluating empirical bounds on complex disease genetic architecture , 2013, Nature Genetics.

[19] D. Trujillano,et al. A homozygous nonsense variant in IFT52 is associated with a human skeletal ciliopathy , 2016, Clinical genetics.

[20] Richard Durbin,et al. Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[21] Xiaohui Xie,et al. DANN: a deep learning approach for annotating the pathogenicity of genetic variants , 2015, Bioinform..

[22] Andrew J. Hill,et al. Analysis of protein-coding genetic variation in 60,706 humans , 2015, bioRxiv.

[23] Jean-Michel Claverie,et al. The human gene damage index as a gene-level approach to prioritizing exome variants , 2015, Proceedings of the National Academy of Sciences.

[24] Aaron R. Quinlan,et al. GEMINI: Integrative Exploration of Genetic Variation and Genome Annotations , 2013, PLoS Comput. Biol..

[25] Ramesh Menon,et al. VDAP-GUI: a user-friendly pipeline for variant discovery and annotation of raw next-generation sequencing data , 2016, 3 Biotech.

[26] Björn Usadel,et al. Trimmomatic: a flexible trimmer for Illumina sequence data , 2014, Bioinform..

[27] Christian Gilissen,et al. Unlocking Mendelian disease using exome sequencing , 2011, Genome Biology.

[28] Hugo Y. K. Lam,et al. Detecting and annotating genetic variations using the HugeSeq pipeline , 2012, Nature Biotechnology.

[29] R. Myers,et al. Advancements in Next-Generation Sequencing. , 2016, Annual review of genomics and human genetics.

[30] J. Shendure,et al. A general framework for estimating the relative pathogenicity of human genetic variants , 2014, Nature Genetics.

[31] Helga Thorvaldsdóttir,et al. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration , 2012, Briefings Bioinform..

[32] K. Okamura,et al. Human genetic variation database, a reference database of genetic variations in the Japanese population , 2016, Journal of Human Genetics.

[33] Z. Xuan,et al. Genome-wide in situ exon capture for selective resequencing , 2007, Nature Genetics.

[34] Zlatko Trajanoski,et al. SIMPLEX: Cloud-Enabled Pipeline for the Comprehensive Analysis of Exome Sequencing Data , 2012, PloS one.

[35] Christian Gilissen,et al. Disease gene identification strategies for exome sequencing , 2012, European Journal of Human Genetics.

[36] Daniel J. Blankenberg,et al. Galaxy: A Web‐Based Genome Analysis Tool for Experimentalists , 2010, Current protocols in molecular biology.

[37] H. Hakonarson,et al. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data , 2010, Nucleic acids research.