MutScan: fast detection and visualization of target mutations by scanning FASTQ data

BackgroundSome types of clinical genetic tests, such as cancer testing using circulating tumor DNA (ctDNA), require sensitive detection of known target mutations. However, conventional next-generation sequencing (NGS) data analysis pipelines typically involve different steps of filtering, which may cause miss-detection of key mutations with low frequencies. Variant validation is also indicated for key mutations detected by bioinformatics pipelines. Typically, this process can be executed using alignment visualization tools such as IGV or GenomeBrowse. However, these tools are too heavy and therefore unsuitable for validating mutations in ultra-deep sequencing data.ResultWe developed MutScan to address problems of sensitive detection and efficient validation for target mutations. MutScan involves highly optimized string-searching algorithms, which can scan input FASTQ files to grab all reads that support target mutations. The collected supporting reads for each target mutation will be piled up and visualized using web technologies such as HTML and JavaScript. Algorithms such as rolling hash and bloom filter are applied to accelerate scanning and make MutScan applicable to detect or visualize target mutations in a very fast way.ConclusionMutScan is a tool for the detection and visualization of target mutations by only scanning FASTQ raw data directly. Compared to conventional pipelines, this offers a very high performance, executing about 20 times faster, and offering maximal sensitivity since it can grab mutations with even one single supporting read. MutScan visualizes detected mutations by generating interactive pile-ups using web technologies. These can serve to validate target mutations, thus avoiding false positives. Furthermore, MutScan can visualize all mutation records in a VCF file to HTML pages for cloud-friendly VCF validation. MutScan is an open source tool available at GitHub: https://github.com/OpenGene/MutScan

[1]  Roberta Bordoni,et al.  Next Generation Sequencing of Pooled Samples: Guideline for Variants’ Filtering , 2016, Scientific Reports.

[2]  Franck Molina,et al.  Clinical validation of the detection of KRAS and BRAF mutations from circulating tumor DNA , 2014, Nature Medicine.

[3]  Ash A. Alizadeh,et al.  Potential clinical utility of ultrasensitive circulating tumor DNA detection with CAPP-Seq , 2015, Expert review of molecular diagnostics.

[4]  Xuelong Li,et al.  A survey of graph edit distance , 2010, Pattern Analysis and Applications.

[5]  Michael Mitzenmacher,et al.  Less hashing, same performance: Building a better Bloom filter , 2006, Random Struct. Algorithms.

[6]  Emma Donald,et al.  Comparison of KRAS Mutation Assessment in Tumor DNA and Circulating Free DNA in Plasma and Serum Samples , 2012, Clinical medicine insights. Pathology.

[7]  Klaus U. Schulz,et al.  Fast string correction with Levenshtein automata , 2002, International Journal on Document Analysis and Recognition.

[8]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[9]  Mauricio O. Carneiro,et al.  From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline , 2013, Current protocols in bioinformatics.

[10]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer , 2011, Nature Biotechnology.

[11]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[12]  D. Cescon,et al.  PIK3CA genotype and treatment decisions in human epidermal growth factor receptor 2-positive breast cancer. , 2015, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[13]  M. Choti,et al.  Detection of Circulating Tumor DNA in Early- and Late-Stage Human Malignancies , 2014, Science Translational Medicine.

[14]  W. Cho,et al.  肺癌的个体化靶向治疗 , 2013, Zhongguo fei ai za zhi = Chinese journal of lung cancer.

[15]  Christopher A. Miller,et al.  VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. , 2012, Genome research.

[16]  Ash A. Alizadeh,et al.  An ultrasensitive method for quantitating circulating tumor DNA with broad patient coverage , 2013, Nature Medicine.

[17]  Yue Han,et al.  AfterQC: automatic filtering, trimming, error removing and quality control for fastq data , 2017, BMC Bioinformatics.

[18]  Linghua Wang,et al.  Genomic sequencing for cancer diagnosis and therapy. , 2014, Annual review of medicine.

[19]  A. Sivachenko,et al.  Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples , 2013, Nature Biotechnology.

[20]  Nikhil Wagle,et al.  The impact of tumor profiling approaches and genomic data strategies for cancer precision medicine , 2016, Genome Medicine.

[21]  Christoph Endrullat,et al.  Standardization and quality management in next-generation sequencing , 2016, Applied & translational genomics.

[22]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[23]  Guoyin Wang,et al.  An Efficient Piecewise Hashing Method for Computer Forensics , 2008, First International Workshop on Knowledge Discovery and Data Mining (WKDD 2008).