iWhale: a computational pipeline based on Docker and SCons for detection and annotation of somatic variants in cancer WES data

Whole exome sequencing (WES) is a powerful approach for discovering sequence variants in cancer cells but its time effectiveness is limited by the complexity and issues of WES data analysis. Here we present iWhale, a customizable pipeline based on Docker and SCons, reliably detecting somatic variants by three complementary callers (MuTect2, Strelka2 and VarScan2). The results are combined to obtain a single variant call format file for each sample and variants are annotated by integrating a wide range of information extracted from several reference databases, ultimately allowing variant and gene prioritization according to different criteria. iWhale allows users to conduct a complex series of WES analyses with a powerful yet customizable and easy-to-use tool, running on most operating systems (macOs, GNU/Linux and Windows). iWhale code is freely available at https://github.com/alexcoppe/iWhale and the docker image is downloadable from https://hub.docker.com/r/alexcoppe/iwhale.

[1]  Andrew P. Weng,et al.  Activating Mutations of NOTCH1 in Human T Cell Acute Lymphoblastic Leukemia , 2004, Science.

[2]  E. Brambilla,et al.  The new tumor suppressor genes ING: Genomic structure and status in cancer , 2008, International journal of cancer.

[3]  Pablo Cingolani,et al.  Using Drosophila melanogaster as a Model for Genotoxic Chemical Mutational Studies with a New Program, SnpSift , 2012, Front. Gene..

[4]  Adam A. Margolin,et al.  NOTCH1 directly regulates c-MYC and activates a feed-forward-loop transcriptional network promoting leukemic cell growth , 2006, Proceedings of the National Academy of Sciences.

[5]  S. Slager,et al.  An analytical workflow for accurate variant discovery in highly divergent regions , 2016, BMC Genomics.

[6]  Carlos Caldas,et al.  Intersect-then-combine approach: improving the performance of somatic variant calling in whole exome sequencing data using multiple aligners and callers , 2017, Genome Medicine.

[7]  Johannes G. Reiter,et al.  An analysis of genetic heterogeneity in untreated cancers , 2019, Nature Reviews Cancer.

[8]  Christopher J Kemp,et al.  Personalized Cancer Models for Target Discovery and Precision Medicine. , 2018, Trends in cancer.

[9]  Youping Deng,et al.  Development of somatic mutation signatures for risk stratification and prognosis in lung and colorectal adenocarcinomas , 2019, BMC Medical Genomics.

[10]  M. Bevan,et al.  Notch1 signaling promotes the maturation of CD4 and CD8 SP thymocytes. , 2000, Immunity.

[11]  C. Begley,et al.  NOTCH1 pathway activation is an early hallmark of SCL T leukemogenesis. , 2007, Blood.

[12]  S. Bortoluzzi,et al.  Somatic mutations activating Wiskott–Aldrich syndrome protein concomitant with RAS pathway mutations in juvenile myelomonocytic leukemia patients , 2018, Human mutation.

[13]  C. Cho,et al.  Identification of Genetic Mutations in Cancer: Challenge and Opportunity in the New Era of Targeted Therapy , 2019, Front. Oncol..

[14]  Michael P. Schroeder,et al.  Cancer Genome Interpreter annotates the biological and clinical relevance of tumor alterations , 2017, Genome Medicine.

[15]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[16]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[17]  Tom R. Gaunt,et al.  Ranking non-synonymous single nucleotide polymorphisms based on disease concepts , 2014, Human Genomics.

[18]  M. Sanders,et al.  Recurrently affected genes in juvenile myelomonocytic leukaemia , 2018, British journal of haematology.

[19]  Aaron R. Quinlan,et al.  Bioinformatics Applications Note Genome Analysis Bedtools: a Flexible Suite of Utilities for Comparing Genomic Features , 2022 .

[20]  A. Hall,et al.  Loss of nuclear expression of the p33ING1b inhibitor of growth protein in childhood acute lymphoblastic leukaemia , 2002, Journal of clinical pathology.

[21]  Li Ding,et al.  Comprehensive Characterization of Cancer Driver Genes and Mutations (vol 173, 371.e1, 2018) , 2018 .

[22]  K. Kinzler,et al.  Cancer Genome Landscapes , 2013, Science.

[23]  Giovanni Martinelli,et al.  Optimized pipeline of MuTect and GATK tools to improve the detection of somatic single nucleotide polymorphisms in whole-exome sequencing data , 2016, BMC Bioinformatics.

[24]  Joshua M. Stuart,et al.  Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection , 2015, Nature Methods.

[25]  Alistair G. Rust,et al.  Cake: a bioinformatics pipeline for the integrated analysis of somatic variants in cancer genomes , 2013, Bioinform..

[26]  T. Clancy,et al.  NeoMutate: an ensemble machine learning framework for the prediction of somatic mutations in cancer , 2019, BMC Medical Genomics.

[27]  Pablo Cingolani,et al.  © 2012 Landes Bioscience. Do not distribute. , 2022 .

[28]  Ryan D. Morin,et al.  Pediatric-type nodal follicular lymphoma: a biologically distinct lymphoma with frequent MAPK pathway mutations. , 2016, Blood.

[29]  Paul Flicek,et al.  The International Genome Sample Resource (IGSR) collection of open human genomic variation resources , 2019, Nucleic Acids Res..

[30]  R. Gerstein,et al.  p37Ing1b regulates B-cell proliferation and cooperates with p53 to suppress diffuse large B-cell lymphomagenesis. , 2008, Cancer research.

[31]  Brittney N. Keel,et al.  Comparison of Burrows-Wheeler Transform-Based Mapping Algorithms Used in High-Throughput Whole-Genome Sequencing: Application to Illumina Data for Livestock Genomes , 2018, Front. Genet..

[32]  S. Bortoluzzi,et al.  A high definition picture of key genes and pathways mutated in pediatric follicular lymphoma , 2019, Haematologica.

[33]  S. Bortoluzzi,et al.  Genomic landscape characterization of large granular lymphocyte leukemia with a systems genetics approach , 2017, Leukemia.

[34]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[35]  G. Abecasis,et al.  An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data , 2015, Genome research.

[36]  Minghui Li,et al.  Finding driver mutations in cancer: Elucidating the role of background mutational processes , 2018, bioRxiv.

[37]  S. Bortoluzzi,et al.  Somatic mutations in specific and connected subpathways are associated with short neuroblastoma patients’ survival and indicate proteins targetable at onset of disease , 2018, International journal of cancer.

[38]  Weitai Huang,et al.  SMuRF: portable and accurate ensemble prediction of somatic mutations , 2019, Bioinform..

[39]  J. Potash,et al.  Validation and assessment of variant calling pipelines for next-generation sequencing , 2014, Human Genomics.

[40]  Tao Xie,et al.  Whole Exome Sequencing of Rapid Autopsy Tumors and Xenograft Models Reveals Possible Driver Mutations Underlying Tumor Progression , 2015, PloS one.

[41]  Jianpeng Xu,et al.  Fastq2vcf: a concise and transparent pipeline for whole-exome sequencing data analyses , 2015, BMC Research Notes.

[42]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[43]  Kai Wang,et al.  SeqMule: automated pipeline for analysis of human exome/genome sequencing data , 2015, Scientific Reports.