Cont-ID: Detection of samples cross-contamination in viral metagenomic data

Background High Throughput sequencing (HTS) technologies completed by the bioinformatic analysis of the generated data are becoming an important detection technique for virus diagnostics. They have the potential to replace or complement the current PCR-based methods thanks to their improved inclusivity and analytical sensitivity, as well as their overall good repeatability and reproducibility. Cross-contamination is a well-known phenomenon in molecular diagnostics and corresponds to the exchange of genetic material between samples. Cross-contamination management was a key drawback during the development of PCR-based detection and is now adequately monitored in routine diagnostics. HTS technologies are facing similar difficulties due to their very high analytical sensitivity. As a single viral read could be detected in millions of sequencing reads, it is mandatory to fix a detection threshold that will be influenced by cross-contamination. Cross-contamination monitoring should therefore be a priority when detecting viruses by HTS technologies. Results We present Cont-ID, a bioinformatic tool designed to check for cross-contamination by analysing the relative abundance of virus sequencing reads identified in sequence metagenomic datasets and their duplication between samples. It can be applied when the samples in a sequencing batch have been processed in parallel in the laboratory and with at least one external alien control. Using 273 real datasets, including 68 virus species from different hosts (fruit tree, plant, human) and several library preparation protocols (Ribodepleted total RNA, small RNA and double stranded RNA), we demonstrated that Cont-ID classifies with high accuracy (91%) viral species detection into (true) infection or (cross) contamination. This classification raises confidence in the detection and facilitates the downstream interpretation and confirmation of the results by prioritising the virus detections that should be confirmed. Conclusions Cross-contamination between samples when detecting viruses using HTS can be monitored and highlighted by Cont-ID (provided an alien control is present). Cont-ID is based on a flexible methodology relying on the output of bioinformatics analyses of the sequencing reads and considering the contamination pattern specific to each batch of samples. The Cont-ID method is adaptable so that each laboratory can optimise it before its validation and routine use.

[1]  S. Massart,et al.  Validation of high throughput sequencing as virus indexing test for Musa germplasm: performance criteria evaluation and contamination monitoring using an alien control , 2022, PhytoFrontiers™.

[2]  S. Massart,et al.  Guidelines for improving statistical analyses of validation datasets for plant pest diagnostic tests , 2022, EPPO Bulletin.

[3]  O. Gascuel,et al.  VirHunter: A Deep Learning-Based Method for Detection of Novel RNA Viruses in Plant Sequencing Data , 2022, Frontiers in Bioinformatics.

[4]  M. Chabannes,et al.  Extrachromosomal viral DNA produced by transcriptionally active endogenous viral elements in non-infected banana hybrids impedes quantitative PCR diagnostics of banana streak virus infections in banana hybrids. , 2021, The Journal of general virology.

[5]  Kristian Stevens,et al.  Quality Assessment and Validation of High-Throughput Sequencing for Grapevine Virus Diagnostics , 2021, Viruses.

[6]  Luis Pedro Coelho,et al.  GUNC: detection of chimerism and contamination in prokaryotic genomes , 2020, Genome Biology.

[7]  M. Chabannes,et al.  Badnaviruses and banana genomes: a long association sheds light on Musa phylogeny and origin , 2020, Molecular plant pathology.

[8]  Lior Pachter,et al.  Swab-Seq: A high-throughput platform for massively scaled up SARS-CoV-2 testing , 2020, medRxiv.

[9]  R. L. Charlebois,et al.  Sensitivity and breadth of detection of high-throughput sequencing for adventitious virus detection , 2020, npj Vaccines.

[10]  Ramesh Kumar,et al.  COVID-19 diagnostic approaches: different roads to the same destination , 2020, VirusDisease.

[11]  M. Shi,et al.  High resolution metagenomic characterization of complex infectomes in paediatric acute respiratory infection , 2020, Scientific Reports.

[12]  Jennifer Lu,et al.  Improved metagenomic analysis with Kraken 2 , 2019, Genome Biology.

[13]  M. Guarracino,et al.  From trash to treasure: detecting unexpected contamination in unmapped NGS data , 2019, BMC Bioinformatics.

[14]  Catherine D. Carrillo,et al.  ConFindr: rapid detection of intraspecies and cross-species contamination in bacterial whole-genome sequence data , 2019, PeerJ.

[15]  T. Candresse,et al.  The VirAnnot Pipeline: A Resource for Automated Viral Diversity Estimation and Operational Taxonomy Units Assignation for Virome Sequencing Data , 2019, Phytobiomes Journal.

[16]  Laurent Mallet,et al.  Current Perspectives on High-Throughput Sequencing (HTS) for Adventitious Virus Detection: Upstream Sample Processing and Library Preparation , 2018, Viruses.

[17]  T. Candresse,et al.  Application of HTS for Routine Plant Virus Diagnostics: State of the Art and Challenges , 2018, Front. Plant Sci..

[18]  K. Brengel-Pesce,et al.  Quality control implementation for universal characterization of DNA and RNA viruses in clinical respiratory samples using single metagenomic next-generation sequencing workflow , 2018, bioRxiv.

[19]  J. Rink,et al.  A software tool ‘CroCo’ detects pervasive cross-species contamination in next generation sequencing data , 2018, BMC Biology.

[20]  T. Candresse,et al.  Viral Double-Stranded RNAs (dsRNAs) from Plants: Alternative Nucleic Acid Substrates for High-Throughput Sequencing. , 2018, Methods in molecular biology.

[21]  Han Yih Lau,et al.  Advanced DNA-Based Point-of-Care Diagnostic Methods for Plant Diseases Detection , 2017, Front. Plant Sci..

[22]  Alexander E. Kel,et al.  cutPrimers: A New Tool for Accurate Cutting of Primers from Reads of Targeted Next Generation Sequencing , 2017, J. Comput. Biol..

[23]  S. Massart,et al.  Lessons learned from the virus indexing of Musa germplasm: insights from a multiyear collaboration , 2017 .

[24]  B. Marçais,et al.  Detection of plant pathogens using real‐time PCR: how reliable are late Ct values? , 2017 .

[25]  N. Galtier,et al.  Patterns of cross-contamination in a multispecies population genomic project: detection, quantification, impact, and solutions , 2017, BMC Biology.

[26]  R. Płoski,et al.  Sensitivity of Next-Generation Sequencing Metagenomic Analysis for Detection of RNA and DNA Viruses in Cerebrospinal Fluid: The Confounding Effect of Background Contamination. , 2016, Advances in experimental medicine and biology.

[27]  Zhangjun Fei,et al.  VirusDetect: An automated pipeline for efficient virus discovery using deep sequencing of small RNAs. , 2017, Virology.

[28]  Steven Salzberg,et al.  Bracken: Estimating species abundance in metagenomics data , 2016, bioRxiv.

[29]  S. Massart,et al.  Current impact and future directions of high throughput sequencing in plant virus diagnostics. , 2014, Virus research.

[30]  Steve Pettifer,et al.  EDAM: an ontology of bioinformatics operations, types of data and identifiers, topics and formats , 2013, Bioinform..

[31]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[32]  E. Bouza,et al.  Impact of Laboratory Cross-Contamination on Molecular Epidemiology Studies of Tuberculosis , 2006, Journal of Clinical Microbiology.

[33]  F. Watzinger,et al.  Detection and monitoring of virus infections by real-time PCR , 2006, Molecular Aspects of Medicine.

[34]  A. Dekker,et al.  Validation of a LightCycler-based reverse transcription polymerase chain reaction for the detection of foot-and-mouth disease virus. , 2003, Journal of virological methods.

[35]  H. Hegyesi,et al.  Applied Biology , 2022, Nature.