A Bioinformatics Procedure to Identify and Annotate Somatic Mutations in Whole-Exome Sequencing Data

The application of next-generation sequencing instruments generates a tremendous amount of sequencing data. This leads to a challenging bioinformatics problem to store, manage and analyze terabytes of sequencing data often generated from extremely different data-sources. Our project is mainly focused on the sequence analysis of human cancer genomes, in order to identify the genetic lesions underlying the development of tumors. However, the automated detection procedure of somatic mutations and a statistical based testing procedure to identify genetic lesions are still an open problem. Therefore, we propose a computational procedure to manage large scale sequencing data in order to detect exonic somatic mutations in a tumor sample. The proposed pipeline includes several steps based on open-source softwares and R language: alignment, detection of mutations, annotation, functional classification and visualization of results. We analyzed whole exome sequencing data from 3 leukemic patients and 3 paired controls plus 1 colon cancer sample and paired control. The results were validated by Sanger sequencing.

[1]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[2]  Kenneth H. Buetow,et al.  Bioinformatics Applications Note Sequence Analysis Bambino: a Variant Detector and Alignment Viewer for Next-generation Sequencing Data in the Sam/bam Format , 2022 .

[3]  Michael A. Choti,et al.  DAXX/ATRX, MEN1, and mTOR Pathway Genes Are Frequently Altered in Pancreatic Neuroendocrine Tumors , 2011, Science.

[4]  Emily H Turner,et al.  Targeted Capture and Massively Parallel Sequencing of Twelve Human Exomes , 2009, Nature.

[5]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[6]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[7]  Cesare Furlanello,et al.  A Machine Learning Pipeline for Discriminant Pathways Identification , 2011, CIBB.

[8]  Justin C. Fay,et al.  Identification of deleterious mutations within three human genomes. , 2009, Genome research.

[9]  H. Hakonarson,et al.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data , 2010, Nucleic acids research.

[10]  Antony V. Cox,et al.  Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing , 2008, Nature Genetics.

[11]  S. Henikoff,et al.  Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm , 2009, Nature Protocols.

[12]  E. Boerwinkle,et al.  dbNSFP: A Lightweight Database of Human Nonsynonymous SNPs and Their Functional Predictions , 2011, Human mutation.

[13]  Stefano Volinia,et al.  GAMES identifies and annotates mutations in next-generation sequencing projects , 2011, Bioinform..