Hierarchical Clustering of DNA k-mer Counts in RNAseq Fastq Files Identifies Sample Heterogeneities

We apply hierarchical clustering (HC) of DNA k-mer counts on multiple Fastq files. The tree structures produced by HC may reflect experimental groups and thereby indicate experimental effects, but clustering of preparation groups indicates the presence of batch effects. Hence, HC of DNA k-mer counts may serve as a diagnostic device. In order to provide a simple applicable tool we implemented sequential analysis of Fastq reads with low memory usage in an R package (seqTools) available on Bioconductor. The approach is validated by analysis of Fastq file batches containing RNAseq data. Analysis of three Fastq batches downloaded from ArrayExpress indicated experimental effects. Analysis of RNAseq data from two cell types (dermal fibroblasts and Jurkat cells) sequenced in our facility indicate presence of batch effects. The observed batch effects were also present in reads mapped to the human genome and also in reads filtered for high quality (Phred > 30). We propose, that hierarchical clustering of DNA k-mer counts provides an unspecific diagnostic tool for RNAseq experiments. Further exploration is required once samples are identified as outliers in HC derived trees.

[1]  Sebastian Deorowicz,et al.  KMC 2: Fast and resource-frugal k-mer counting , 2014, Bioinform..

[2]  Jeroen F. J. Laros,et al.  Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories , 2013, Nature Biotechnology.

[3]  David R. Kelley,et al.  Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks , 2012, Nature Protocols.

[4]  Swati C Manekar,et al.  A benchmark study of k-mer counting methods for high-throughput sequencing , 2018, GigaScience.

[5]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[6]  Peter Dalgaard,et al.  R Development Core Team (2010): R: A language and environment for statistical computing , 2010 .

[7]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[8]  Robert Petryszak,et al.  ArrayExpress update—simplifying data submissions , 2014, Nucleic Acids Res..

[9]  Sandrine Dudoit,et al.  GC-Content Normalization for RNA-Seq Data , 2011, BMC Bioinformatics.

[10]  P. Edwards,et al.  The role of tandem duplicator phenotype in tumour evolution in high‐grade serous ovarian cancer , 2012, The Journal of pathology.

[11]  Holger Schwender,et al.  rbamtools: an R interface to samtools enabling fast accumulative tabulation of splicing events over multiple RNA-seq samples , 2015, Bioinform..

[12]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[13]  Kui Zhang,et al.  Length bias correction for RNA-seq data in gene set analyses , 2011, Bioinform..

[14]  Daniel Mapleson,et al.  KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies , 2016, bioRxiv.

[15]  Yongxiang Fang,et al.  NUDT2 Disruption Elevates Diadenosine Tetraphosphate (Ap4A) and Down-Regulates Immune Response and Cancer Promotion Genes , 2016, PloS one.

[16]  Rob Patro,et al.  Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms , 2013, Nature Biotechnology.

[17]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[18]  Dominique Lavenier,et al.  DSK: k-mer counting with very low memory usage , 2013, Bioinform..

[19]  Gabriel Goldstein,et al.  Improved assembly of noisy long reads by k-mer validation , 2016, bioRxiv.

[20]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[21]  Jonas S. Almeida,et al.  Alignment-free sequence comparison: benefits, applications, and tools , 2017, Genome Biology.

[22]  Paul Medvedev,et al.  Informed and automated k-mer size selection for genome assembly , 2013, Bioinform..

[23]  Szymon Grabowski,et al.  Disk-based k-mer counting on a PC , 2012, BMC Bioinformatics.

[24]  G. N. Lance,et al.  Computer Programs for Hierarchical Polythetic Classification ("Similarity Analyses") , 1966, Comput. J..

[25]  Gesine Reinert,et al.  Alignment-Free Sequence Analysis and Applications. , 2018, Annual review of biomedical data science.

[26]  S. Lonardi,et al.  CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers , 2015, BMC Genomics.

[27]  Páll Melsted,et al.  Efficient counting of k-mers in DNA sequences using a bloom filter , 2011, BMC Bioinformatics.

[28]  Yudong D. He,et al.  Effects of atmospheric ozone on microarray data quality. , 2003, Analytical chemistry.

[29]  D. Frick,et al.  The MutT Proteins or “Nudix” Hydrolases, a Family of Versatile, Widely Distributed, “Housecleaning” Enzymes* , 1996, The Journal of Biological Chemistry.

[30]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[31]  Paul Medvedev,et al.  Error correction of high-throughput sequencing datasets with non-uniform coverage , 2011, Bioinform..

[32]  P. Boukamp,et al.  Age, gender and UV-exposition related effects on gene expression in in vivo aged short term cultivated human dermal fibroblasts , 2017, PloS one.

[33]  P. Carmeliet,et al.  Inhibition of the Glycolytic Activator PFKFB3 in Endothelium Induces Tumor Vessel Normalization, Impairs Metastasis, and Improves Chemotherapy. , 2016, Cancer cell.

[34]  Qingpeng Zhang,et al.  These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure , 2013, PloS one.

[35]  Jeroen F. J. Laros,et al.  Determining the quality and complexity of next-generation sequencing data without a reference genome , 2014, Genome Biology.

[36]  K. Hansen,et al.  Biases in Illumina transcriptome sequencing caused by random hexamer priming , 2010, Nucleic acids research.

[37]  Robert Patro,et al.  Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms , 2013, ArXiv.

[38]  Daniel J. Gaffney,et al.  A survey of best practices for RNA-seq data analysis , 2016, Genome Biology.

[39]  Aaron L. Halpern,et al.  Consensus generation and variant detection by Celera Assembler , 2008, Bioinform..

[40]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.