SeqTailor: a user-friendly webserver for the extraction of DNA or protein sequences from next-generation sequencing data

Human whole-genome sequencing generally reveals about 4,000,000 genetic variants, including 20,000 coding variants, in each individual studied. These data are mostly stored as VCF-format files. Although many variant analysis methods accept VCF files as input, many other tools require DNA or protein sequences, particularly for splicing prediction, sequence alignment, phylogenetic analysis, and structure prediction. However, there is currently no existing online tool for extracting DNA or protein sequences for genomic variants from VCF files with user-defined parameters in a user-friendly, efficient, and standardized manner. We developed the SeqTailor webserver to bridge this gap. It can be used for the rapid extraction of (1) DNA sequences around genetic variants, with customizable window sizes, from the hg19 or hg38 human reference genomes; and (2) protein sequences encoded by the DNA sequences around genetic variants, with built-in SnpEff annotation and customizable window sizes, from human canonical transcripts. The SeqTailor webserver streamlines the sequence extraction process, and accelerates the analysis of genetic variant data with software requiring DNA or protein sequences. SeqTailor will facilitate the study of human genomic variation, by increasing the feasibility of sequence-based analysis and prediction. The SeqTailor webserver is freely available from http://shiva.rockefeller.edu/SeqTailor/.

[1]  Geoffrey I. Webb,et al.  POSSUM: a bioinformatics toolkit for generating numerical sequence feature descriptors based on PSSM profiles , 2017, Bioinform..

[2]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[3]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[4]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[5]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[6]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration , 2012, Briefings Bioinform..

[7]  Pablo Cingolani,et al.  © 2012 Landes Bioscience. Do not distribute. , 2022 .

[8]  Arndt von Haeseler,et al.  W-IQ-TREE: a fast online phylogenetic tool for maximum likelihood analysis , 2016, Nucleic Acids Res..

[9]  Silvio C. E. Tosatto,et al.  InterPro in 2017—beyond protein family and domain annotations , 2016, Nucleic Acids Res..

[10]  Robert D. Finn,et al.  HMMER web server: 2018 update , 2018, Nucleic Acids Res..

[11]  Kian Huat Lim,et al.  Spliceman - a computational web server that predicts sequence variations in pre-mRNA splicing , 2012, Bioinform..

[12]  J. Shendure,et al.  A general framework for estimating the relative pathogenicity of human genetic variants , 2014, Nature Genetics.

[13]  James Y. Zou Analysis of protein-coding genetic variation in 60,706 humans , 2015, Nature.

[14]  Torsten Schwede,et al.  SWISS-MODEL: homology modelling of protein structures and complexes , 2018, Nucleic Acids Res..

[15]  Terrence S. Furey,et al.  The UCSC Table Browser data retrieval tool , 2004, Nucleic Acids Res..

[16]  P. Stenson,et al.  The Human Gene Mutation Database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next-generation sequencing studies , 2017, Human Genetics.

[17]  Z. R. Li,et al.  PROFEAT Update: A Protein Features Web Server with Added Facility to Compute Network Descriptors for Studying Omics-Derived Networks. , 2017, Journal of molecular biology.

[18]  P. Bork,et al.  A method and server for predicting damaging missense mutations , 2010, Nature Methods.

[19]  C. Béroud,et al.  Human Splicing Finder: an online bioinformatics tool to predict splicing signals , 2009, Nucleic acids research.

[20]  Ziheng Yang PAML 4: phylogenetic analysis by maximum likelihood. , 2007, Molecular biology and evolution.

[21]  Aaron R. Quinlan,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[22]  R. Durbin,et al.  Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly , 2016, bioRxiv.

[23]  Robert D. Finn,et al.  The Pfam protein families database: towards a more sustainable future , 2015, Nucleic Acids Res..

[24]  Astrid Gall,et al.  Ensembl 2018 , 2017, Nucleic Acids Res..

[25]  P. Stenson,et al.  The Human Gene Mutation Database (HGMD) and Its Exploitation in the Fields of Personalized Genomics and Molecular Evolution , 2012, Current protocols in bioinformatics.

[26]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[27]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[28]  Peng Zhang,et al.  PopViz: a webserver for visualizing minor allele frequencies and damage prediction scores of human genetic variations , 2018, Bioinform..

[29]  H. Hakonarson,et al.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data , 2010, Nucleic acids research.

[30]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[31]  Kazutaka Katoh,et al.  MAFFT online service: multiple sequence alignment, interactive sequence choice and visualization , 2017, Briefings Bioinform..

[32]  S. Knudsen,et al.  Prediction of human mRNA donor and acceptor sites from the DNA sequence. , 1991, Journal of molecular biology.

[33]  Deanna M. Church,et al.  ClinVar: public archive of relationships among sequence variation and human phenotype , 2013, Nucleic Acids Res..

[34]  F. Cunningham,et al.  The Ensembl Variant Effect Predictor , 2016, Genome Biology.

[35]  Jorng-Tzong Horng,et al.  An enhanced computational platform for investigating the roles of regulatory RNA and for identifying functional RNA motifs , 2013, BMC Bioinformatics.