NCBI’s Virus Discovery Hackathon: Engaging Research Communities to Identify Cloud Infrastructure Requirements

A wealth of viral data sits untapped in publicly available metagenomic data sets when it might be extracted to create a usable index for the virological research community. We hypothesized that work of this complexity and scale could be done in a hackathon setting. Ten teams comprised of over 40 participants from six countries, assembled to create a crowd-sourced set of analysis and processing pipelines for a complex biological data set in a three-day event on the San Diego State University campus starting 9 January 2019. Prior to the hackathon, 141,676 metagenomic data sets from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) were pre-assembled into contiguous assemblies (contigs) by NCBI staff. During the hackathon, a subset consisting of 2953 SRA data sets (approximately 55 million contigs) was selected, which were further filtered for a minimal length of 1 kb. This resulted in 4.2 million (Mio) contigs, which were aligned using BLAST against all known virus genomes, phylogenetically clustered and assigned metadata. Out of the 4.2 Mio contigs, 360,000 contigs were labeled with domains and an additional subset containing 4400 contigs was screened for virus or virus-like genes. The work yielded valuable insights into both SRA data and the cloud infrastructure required to support such efforts, revealing analysis bottlenecks and possible workarounds thereof. Mainly: (i) Conservative assemblies of SRA data improves initial analysis steps; (ii) existing bioinformatic software with weak multithreading/multicore support can be elevated by wrapper scripts to use all cores within a computing node; (iii) redesigning existing bioinformatic algorithms for a cloud infrastructure to facilitate its use for a wider audience; and (iv) a cloud infrastructure allows a diverse group of researchers to collaborate effectively. The scientific findings will be extended during a follow-up event. Here, we present the applied workflows, initial results, and lessons learned from the hackathon.

[1]  Tim Bray,et al.  Internet Engineering Task Force (ietf) the Javascript Object Notation (json) Data Interchange Format , 2022 .

[2]  Eugene V. Koonin,et al.  Prokaryotic Virus Orthologous Groups (pVOGs): a resource for comparative genomics and protein family annotation , 2016, Nucleic Acids Res..

[3]  Amos Bairoch,et al.  ViralZone: a knowledge resource to understand virus diversity , 2010, Nucleic Acids Res..

[4]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[5]  Steven L Salzberg,et al.  HISAT: a fast spliced aligner with low memory requirements , 2015, Nature Methods.

[6]  Alexandre Souvorov,et al.  SKESA: strategic k-mer extension for scrupulous assemblies , 2018, Genome Biology.

[7]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[8]  Colin Hill,et al.  VIGA: a sensitive, precise and automatic de novo VIral Genome Annotator , 2018, bioRxiv.

[9]  Edward C. Holmes,et al.  Redefining the invertebrate RNA virosphere , 2016, Nature.

[10]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[11]  Matthew B. Sullivan,et al.  VirSorter: mining viral signal from microbial genomic data , 2015, PeerJ.

[12]  P. Daszak,et al.  The Global Virome Project , 2018, Science.

[13]  Evan Bolton,et al.  Database resources of the National Center for Biotechnology Information , 2017, Nucleic Acids Res..

[14]  Sean R. Davis,et al.  SRAdb: query and use public next-generation sequencing data from within R , 2013, BMC Bioinformatics.

[15]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[16]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[17]  Bo Yu,et al.  CDD/SPARCLE: functional classification of proteins via subfamily domain architectures , 2016, Nucleic Acids Res..

[18]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[19]  Elaine R. Mardis,et al.  A decade’s perspective on DNA sequencing technology , 2011, Nature.

[20]  Mike Mikailov,et al.  A Reference Viral Database (RVDB) To Enhance Bioinformatics Analysis of High-Throughput Sequencing for Novel Virus Detection , 2018, mSphere.

[21]  Robert A. Edwards,et al.  PARTIE: a partition engine to separate metagenomic and amplicon projects in the Sequence Read Archive , 2017, Bioinform..

[22]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[23]  Miriam L. Land,et al.  Trace: Tennessee Research and Creative Exchange Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification Recommended Citation Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification , 2022 .

[24]  Tanja Woyke,et al.  Viral dark matter and virus–host interactions resolved from publicly available microbial genomes , 2015, eLife.

[25]  Yiming Bao,et al.  NCBI Viral Genomes Resource , 2014, Nucleic Acids Res..

[26]  Sean R. Eddy,et al.  Infernal 1.1: 100-fold faster RNA homology searches , 2013, Bioinform..

[27]  Ken Youens-Clark,et al.  Libra: scalable k-mer–based tool for massive all-vs-all metagenome comparisons , 2018, GigaScience.

[28]  Sergey Koren,et al.  Mash Screen: high-throughput sequence containment estimation for genome discovery , 2019, Genome Biology.

[29]  Rasko Leinonen,et al.  The sequence read archive: explosive growth of sequencing data , 2011, Nucleic Acids Res..

[30]  R. Edwards,et al.  A highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes , 2014, Nature Communications.