Sources of erroneous sequences and artifact chimeric reads in next generation sequencing of genomic DNA from formalin-fixed paraffin-embedded samples

Abstract Tissues used in pathology laboratories are typically stored in the form of formalin-fixed, paraffin-embedded (FFPE) samples. One important consideration in repurposing FFPE material for next generation sequencing (NGS) analysis is the sequencing artifacts that can arise from the significant damage to nucleic acids due to treatment with formalin, storage at room temperature and extraction. One such class of artifacts consists of chimeric reads that appear to be derived from non-contiguous portions of the genome. Here, we show that a major proportion of such chimeric reads align to both the ‘Watson’ and ‘Crick’ strands of the reference genome. We refer to these as strand-split artifact reads (SSARs). This study provides a conceptual framework for the mechanistic basis of the genesis of SSARs and other chimeric artifacts along with supporting experimental evidence, which have led to approaches to reduce the levels of such artifacts. We demonstrate that one of these approaches, involving S1 nuclease-mediated removal of single-stranded fragments and overhangs, also reduces sequence bias, base error rates, and false positive detection of copy number and single nucleotide variants. Finally, we describe an analytical approach for quantifying SSARs from NGS data.

[1]  Steven J. M. Jones,et al.  Frequent mutation of histone modifying genes in non-Hodgkin lymphoma , 2011, Nature.

[2]  Helen M. Moore,et al.  A review of preanalytical factors affecting molecular, protein, and morphological analysis of formalin-fixed, paraffin-embedded (FFPE) tissue: how well do you know your FFPE specimen? , 2014, Archives of pathology & laboratory medicine.

[3]  Clare Verrill,et al.  Clinical whole-genome sequencing from routine formalin-fixed, paraffin-embedded specimens: pilot study for the 100,000 Genomes Project , 2018, Genetics in Medicine.

[4]  L. Staudt,et al.  Burkitt Lymphoma Genome Sequencing Project (BLGSP): Introduction , 2016 .

[5]  Joaquín Dopazo,et al.  Qualimap: evaluating next-generation sequencing alignment data , 2012, Bioinform..

[6]  Thomas M. Keane,et al.  A simple method for directional transcriptome sequencing using Illumina technology , 2009, Nucleic acids research.

[7]  I. Lehman 11 Endonucleases Specific for Single-Stranded Polynucleotides , 1981 .

[8]  Alexander Dobrovic,et al.  Sequence artifacts in DNA from formalin-fixed tissues: causes and strategies for minimization. , 2015, Clinical chemistry.

[9]  Wendy S. W. Wong,et al.  Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs , 2012, Bioinform..

[10]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer , 2011, Nature Biotechnology.

[11]  Martin Jones,et al.  Automated high throughput nucleic acid purification from formalin-fixed paraffin-embedded tissue samples for next generation sequence analysis , 2017, PloS one.

[12]  Hyunbin Kim,et al.  FIREVAT: finding reliable variants without artifacts in human cancer samples using etiologically relevant mutational signatures , 2019, Genome Medicine.

[13]  Jamie R. Kutasovic,et al.  Evaluating the repair of DNA derived from formalin-fixed paraffin-embedded tissues prior to genomic profiling by SNP–CGH analysis , 2013, Laboratory Investigation.

[14]  Hendrik Poinar,et al.  Surveying the repair of ancient DNA from bones via high-throughput sequencing. , 2015, BioTechniques.

[15]  Andrew C. Adey,et al.  Rapid, low-input, low-bias construction of shotgun fragment libraries by high-density in vitro transposition , 2010, Genome Biology.

[16]  Steven J. M. Jones,et al.  Evolution of an adenocarcinoma in response to selection by targeted kinase inhibitors , 2010, Genome Biology.

[17]  A. I. Gaziev [DNA ligases]. , 1974, Uspekhi sovremennoi biologii.

[18]  Måns Magnusson,et al.  MultiQC: summarize analysis results for multiple tools and samples in a single report , 2016, Bioinform..