“Escalibur”—A practical pipeline for the de novo analysis of nucleotide variation in nonmodel eukaryotes

The revolution in genomics has enabled large‐scale population genetic investigations of a wide range of organisms, but there has been a relatively limited focus on improving analytical pipelines. To efficiently analyse large data sets, highly integrated and automated software pipelines, which are easy to use, efficient, reliable, reproducible and run in multiple computational environments, are required. A number of software workflows have been developed to handle and process such data sets for population genetic analyses, but effective, specialized pipelines for genetic and statistical analyses of nonmodel organisms are lacking. For most species, resources for variomes (sets of genetic variations found in populations of species) are not available, and/or genome assemblies are often incomplete and fragmented, complicating the selection of the most suitable reference genome when multiple assemblies are available. Additionally, the biological samples used often contain extraneous DNA from sources other than the species under investigation (e.g., microbial contamination), which needs to be removed prior to genetic analyses. For these reasons, we established a new pipeline, called Escalibur, which includes: functionalities, such as data trimming and mapping; selection of a suitable reference genome; removal of contaminating read data; recalibration of base calls; and variant‐calling. Escalibur uses a proven gatk variant caller and workflow description language (WDL), and is, therefore, a highly efficient and scalable pipeline for the genome‐wide identification of nucleotide variation in eukaryotes. This pipeline is available at https://gitlab.unimelb.edu.au/bioscience/escalibur (version 0.3‐beta) and is essentially applicable to any prokaryote or eukaryote.

[1]  Tao Liu,et al.  Benchmarking variant callers in next-generation and third-generation sequencing analysis , 2020, Briefings Bioinform..

[2]  Philip A. Ewels,et al.  Sarek: A portable workflow for whole-genome sequencing analysis of germline and somatic variants , 2020, F1000Research.

[3]  Hongbin Zhong,et al.  Systematic comparison of germline variant calling pipelines cross multiple next-generation sequencers , 2019, Scientific Reports.

[4]  John T Jones,et al.  Genome Evolution of Plant-Parasitic Nematodes. , 2017, Annual review of phytopathology.

[5]  Vanessa Sochat,et al.  Singularity: Scientific containers for mobility of compute , 2017, PloS one.

[6]  J. Gilleard,et al.  Resequencing Helminth Genomes for Population and Genetic Studies. , 2017, Trends in parasitology.

[7]  Mary Goldman,et al.  Toil enables reproducible, open source, big biomedical data analyses , 2017, Nature Biotechnology.

[8]  Daniel E. Cook,et al.  CeNDR, the Caenorhabditis elegans natural diversity resource , 2016, Nucleic Acids Res..

[9]  S. Salzberg,et al.  Centrifuge: rapid and sensitive classification of metagenomic sequences , 2016, bioRxiv.

[10]  J. McPherson,et al.  Coming of age: ten years of next-generation sequencing technologies , 2016, Nature Reviews Genetics.

[11]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[12]  Christine Tranchant-Dubreuil,et al.  TOGGLE: toolbox for generic NGS analyses , 2015, BMC Bioinformatics.

[13]  Peter White,et al.  Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics , 2015, Genome Biology.

[14]  Björn Usadel,et al.  Trimmomatic: a flexible trimmer for Illumina sequence data , 2014, Bioinform..

[15]  Mikhail Pachkov,et al.  Automated Reconstruction of Whole-Genome Phylogenies from Short-Sequence Reads , 2014, Molecular biology and evolution.

[16]  Daniel Blankenberg,et al.  CloudMap: A Cloud-Based Pipeline for Analysis of Mutant Genome Sequences , 2012, Genetics.

[17]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[18]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[19]  Peter M. Rice,et al.  The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants , 2009, Nucleic acids research.

[20]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[21]  Stephen M. Mount,et al.  The genome sequence of Drosophila melanogaster. , 2000, Science.

[22]  Pasi K. Korhonen,et al.  Elucidating the molecular and developmental biology of parasitic nematodes: Moving to a multiomics paradigm. , 2020, Advances in parasitology.

[23]  Sequence analysis Advance Access publication June 7, 2011 The variant call format and VCFtools , 2010 .

[24]  J. Berg Genome sequence of the nematode C. elegans: a platform for investigating biology. , 1998, Science.