Indel variant analysis of short-read sequencing data with Scalpel

As the second most common type of variation in the human genome, insertions and deletions (indels) have been linked to many diseases, but the discovery of indels of more than a few bases in size from short-read sequencing data remains challenging. Scalpel (http://scalpel.sourceforge.net) is an open-source software for reliable indel detection based on the microassembly technique. It has been successfully used to discover mutations in novel candidate genes for autism, and it is extensively used in other large-scale studies of human diseases. This protocol gives an overview of the algorithm and describes how to use Scalpel to perform highly accurate indel calling from whole-genome and whole-exome sequencing data. We provide detailed instructions for an exemplary family-based de novo study, but we also characterize the other two supported modes of operation: single-sample and somatic analysis. Indel normalization, visualization and annotation of the mutations are also illustrated. Using a standard server, indel discovery and characterization in the exonic regions of the example sequencing data can be completed in ∼5 h after read mapping.

[1]  Radhey S. Gupta Protein Phylogenies and Signature Sequences: A Reappraisal of Evolutionary Relationships among Archaebacteria, Eubacteria, and Eukaryotes , 1998, Microbiology and Molecular Biology Reviews.

[2]  Competing financial interests , 2001, Nature Immunology.

[3]  Dee R. Denver,et al.  High mutation rate and predominance of insertions in the Caenorhabditis elegans nuclear genome , 2004, Nature.

[4]  R. Hudson,et al.  Single-nucleotide mutation rate increases close to insertions/deletions in eukaryotes , 2008, Nature.

[5]  Norikuni Saka,et al.  Loss of Function of a Proline-Containing Protein Confers Durable Disease Resistance in Rice , 2009, Science.

[6]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[7]  Kai Ye,et al.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[8]  A. Gnirke,et al.  High-quality draft assemblies of mammalian genomes from massively parallel sequence data , 2010, Proceedings of the National Academy of Sciences.

[9]  Ryan E. Mills,et al.  Small insertions and deletions (INDELs) in human genomes. , 2010, Human molecular genetics.

[10]  Aaron R. Quinlan,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[11]  H. Hakonarson,et al.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data , 2010, Nucleic acids research.

[12]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[13]  Daniel Rios,et al.  Bioinformatics Applications Note Databases and Ontologies Deriving the Consequences of Genomic Variants with the Ensembl Api and Snp Effect Predictor , 2022 .

[14]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[15]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer , 2011, Nature Biotechnology.

[16]  R. Durbin,et al.  Dindel: accurate indel calls from short-read data. , 2011, Genome research.

[17]  Kenny Q. Ye,et al.  De Novo Gene Disruptions in Children on the Autistic Spectrum , 2012, Neuron.

[18]  M. Rieder,et al.  Detection of structural variants and indels within exome data , 2011, Nature Methods.

[19]  S. Rosset,et al.  lobSTR: A short tandem repeat profiler for personal genomes , 2012, RECOMB.

[20]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[21]  Joseph K. Pickrell,et al.  A Systematic Survey of Loss-of-Function Variants in Human Protein-Coding Genes , 2012, Science.

[22]  Pablo Cingolani,et al.  © 2012 Landes Bioscience. Do not distribute. , 2022 .

[23]  A. Børresen-Dale,et al.  Mutational Processes Molding the Genomes of 21 Breast Cancers , 2012, Cell.

[24]  Zamin Iqbal,et al.  Identifying and Classifying Trait Linked Polymorphisms in Non-Reference Species by Walking Coloured de Bruijn Graphs , 2013, PloS one.

[25]  Aaron R. Quinlan,et al.  GEMINI: Integrative Exploration of Genetic Variation and Genome Annotations , 2013, PLoS Comput. Biol..

[26]  Mauricio O. Carneiro,et al.  From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline , 2013, Current protocols in bioinformatics.

[27]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[28]  G. Highnam,et al.  Accurate human microsatellite genotypes from high-throughput resequencing data using informed error profiles , 2012, Nucleic acids research.

[29]  Yingrui Li,et al.  SOAPindel: Efficient identification of indels from short paired reads , 2013, Genome research.

[30]  Murim Choi,et al.  De novo mutations in histone modifying genes in congenital heart disease , 2013, Nature.

[31]  Mark Gerstein,et al.  The origin, evolution, and functional impact of short insertion–deletion variants identified in 179 human genomes , 2013, Genome research.

[32]  Jean-Baptiste Cazier,et al.  Choice of transcripts and software has a large effect on variant annotation , 2014, Genome Medicine.

[33]  C. Nusbaum,et al.  Comprehensive variation discovery in single human genomes , 2014, Nature Genetics.

[34]  Matthew D. Wilkerson,et al.  ABRA: improved coding indel detection via assembly-based realignment , 2014, Bioinform..

[35]  G. Weinstock,et al.  TIGRA: A targeted iterative graph routing assembler for breakpoint assembly , 2014, Genome research.

[36]  Boris Yamrom,et al.  The contribution of de novo coding mutations to autism spectrum disorder , 2014, Nature.

[37]  G. McVean,et al.  Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications , 2014, Nature Genetics.

[38]  Vladimir Vacic,et al.  Comparative sequencing analysis reveals high genomic concordance between matched primary and metastatic colorectal cancer lesions , 2014, Genome Biology.

[39]  M. Schatz,et al.  Reducing INDEL calling errors in whole genome and exome sequencing data , 2014, Genome Medicine.

[40]  Michael R. Speicher,et al.  A survey of tools for variant analysis of next-generation genome sequencing data , 2013, Briefings Bioinform..

[41]  Andrei L. Turinsky,et al.  The missing indels: an estimate of indel variation in a human genome and analysis of factors that impede detection , 2015, Nucleic acids research.

[42]  J. Zook,et al.  An analytical framework for optimizing variant discovery from personal genomes , 2015, Nature Communications.

[43]  J. Landolin,et al.  Assembling large genomes with single-molecule sequencing and locality-sensitive hashing , 2014, Nature Biotechnology.

[44]  Gonçalo R. Abecasis,et al.  Unified representation of genetic variants , 2015, Bioinform..

[45]  Rendong Yang,et al.  ScanIndel: a hybrid framework for indel detection via gapped alignment, split reads and de novo assembly , 2015, Genome Medicine.

[46]  Michael C. Schatz,et al.  The Challenge of Small-Scale Repeats for Indel Discovery , 2015, Front. Bioeng. Biotechnol..

[47]  F. Collins,et al.  A new initiative on precision medicine. , 2015, The New England journal of medicine.

[48]  Xiaoyu Chen,et al.  Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications , 2016, Bioinform..

[49]  G. McVean,et al.  A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree , 2016, bioRxiv.