Measuring reproducibility of virus metagenomics analyses using bootstrap samples from FASTQ-files

MOTIVATION High-throughput sequencing data can be affected by different technical errors, e.g. from probe preparation or false base calling. As a consequence, reproducibility of experiments can be weakened. In virus metagenomics, technical errors can result in falsely identified viruses in samples from infected hosts. We present a new resampling approach based on bootstrap sampling of sequencing reads from FASTQ-files in order to generate artificial replicates of sequencing runs which can help to judge the robustness of an analysis. In addition, we evaluate a mixture model on the distribution of read counts per virus to identify potentially false positive findings. RESULTS The evaluation of our approach on an artificially generated data set with known viral sequence content shows in general a high reproducibility of uncovering viruses in sequencing data. I.e., the correlation between original and mean bootstrap read count was highly correlated. However, the bootstrap read counts can also indicate reduced or increased evidence for the presence of a virus in the biological sample. We also found that the mixture model fits well to the read counts, and furthermore, it provides a higher accuracy on the original or on the bootstrap read counts than on the difference between both. The usefulness of our methods is further demonstrated on two freely available real world data sets from harbour seals. AVAILABILITY We provide a Phyton tool, called RESEQ, available from https://github.com/babaksaremi/RESEQ that allows efficient generation of bootstrap reads from an original FASTQ-file. CONTACT klaus.jung@tiho-hannover.de. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Adam Grundhoff,et al.  DAMIAN: an open source bioinformatics tool for fast, systematic and cohort based analysis of microorganisms in diagnostic samples , 2019, Scientific Reports.

[2]  Muhammad Idrees,et al.  In silico structural elucidation of RNA-dependent RNA polymerase towards the identification of potential Crimean-Congo Hemorrhagic Fever Virus inhibitors , 2019, Scientific Reports.

[3]  E. Kroon,et al.  New Isolates of Pandoraviruses: Contribution to the Study of Replication Cycle Steps , 2018, Journal of Virology.

[4]  Wendy K. Jo,et al.  Virus detection in high-throughput sequencing data without a reference genome of the host. , 2018, Infection, genetics and evolution : journal of molecular epidemiology and evolutionary genetics in infectious diseases.

[5]  Otávio G G Almeida,et al.  Bioinformatics tools to assess metagenomic data for applied microbiology , 2018, Applied Microbiology and Biotechnology.

[6]  U. Siebert,et al.  BACTERIAL MICROBIOTA IN HARBOR SEALS (PHOCA VITULINA) FROM THE NORTH SEA OF SCHLESWIG-HOLSTEIN, GERMANY, AROUND THE TIME OF MORBILLIVIRUS AND INFLUENZA EPIDEMICS , 2017, Journal of Wildlife Diseases.

[7]  R. V. Vega Thurber,et al.  Brain transcriptomes of harbor seals demonstrate gene expression patterns of animals undergoing a metabolic disease and a viral infection , 2016, PeerJ.

[8]  John P. A. Ioannidis,et al.  What does research reproducibility mean? , 2016, Science Translational Medicine.

[9]  A. Hicks,et al.  Discovery of a Novel Hepatovirus (Phopivirus of Seals) Related to Human Hepatitis A Virus , 2015, mBio.

[10]  U. Siebert,et al.  Avian Influenza A(H10N7) Virus–Associated Mass Deaths among Harbor Seals , 2015, Emerging infectious diseases.

[11]  Martin Beer,et al.  RIEMS: a software pipeline for sensitive and comprehensive taxonomic classification of reads from metagenomics datasets , 2015, BMC Bioinformatics.

[12]  Ion I Măndoiu,et al.  Bootstrap-based differential gene expression analysis for RNA-Seq data with and without replicates , 2014, BMC Genomics.

[13]  H. Ackermann,et al.  A giant Pseudomonas phage from Poland , 2013, Archives of Virology.

[14]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[15]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[16]  Mukesh Jain,et al.  NGS QC Toolkit: A Toolkit for Quality Control of Next Generation Sequencing Data , 2012, PloS one.

[17]  Peter J. Bickel,et al.  Measuring reproducibility of high-throughput experiments , 2011, 1110.4705.

[18]  Leszek Rychlewski,et al.  The Phaeodactylum genome reveals the evolutionary history of diatom genomes , 2008, Nature.

[19]  U. Siebert,et al.  Pathological findings in harbour seals (Phoca vitulina): 1996-2005. , 2007, Journal of comparative pathology.

[20]  J. Raga,et al.  Parasites in harbour seals (Phoca vitulina) from the German Wadden Sea between two Phocine Distemper Virus epidemics , 2007, Helgoland Marine Research.

[21]  J. Teilmann,et al.  The 1988 and 2002 phocine distemper virus epidemics in European harbour seals. , 2006, Diseases of aquatic organisms.

[22]  L. Avery,et al.  Bacteriophages--potential for application in wastewater treatment processes. , 2005, The Science of the total environment.

[23]  E. Snyder,et al.  Reproducibility, bioinformatic analysis and power of the SAGE method to evaluate changes in transcriptome , 2005, Nucleic acids research.

[24]  Jeffrey S. Morris,et al.  Reproducibility of SELDI-TOF protein patterns in serum: comparing datasets from different experiments , 2004, Bioinform..

[25]  A. Osterhaus,et al.  Genetic characterization of the unique short segment of phocid herpesvirus type 1 reveals close relationships among alphaherpesviruses of hosts of the order Carnivora. , 2003, The Journal of general virology.

[26]  Friedrich Leisch,et al.  Evaluation of structure and reproducibility of cluster solutions using the bootstrap , 2010 .

[27]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[28]  Hans R. Künsch,et al.  Matched-block bootstrap for dependent data , 1995 .

[29]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .