Swarm v3: towards tera-scale amplicon clustering

Abstract Motivation Previously we presented swarm, an open-source amplicon clustering programme that produces fine-scale molecular operational taxonomic units (OTUs) that are free of arbitrary global clustering thresholds. Here, we present swarm v3 to address issues of contemporary datasets that are growing towards tera-byte sizes. Results When compared with previous swarm versions, swarm v3 has modernized C++ source code, reduced memory footprint by up to 50%, optimized CPU-usage and multithreading (more than 7 times faster with default parameters), and it has been extensively tested for its robustness and logic. Availability and implementation Source code and binaries are available at https://github.com/torognes/swarm. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Paul J. McMurdie,et al.  DADA2: High resolution sample inference from Illumina amplicon data , 2016, Nature Methods.

[2]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[3]  Albert L. Zobrist,et al.  A New Hashing Method with Application for Game Playing , 1990 .

[4]  M. Dunthorn,et al.  Evaluating geographic variation within molecular operational taxonomic units (OTUs) using network analyses in Scandinavian lakes , 2020, bioRxiv.

[5]  T. Rognes,et al.  Swarm v2: highly-scalable and high-resolution amplicon clustering , 2015, PeerJ.

[6]  P. Bork,et al.  Eukaryotic plankton diversity in the sunlit ocean , 2015, Science.

[7]  Alexey M. Kozlov,et al.  Parasites dominate hyperdiverse soil protist communities in Neotropical rainforests , 2017, Nature Ecology &Evolution.

[8]  Ian M. Mitchell,et al.  Best Practices for Scientific Computing , 2012, PLoS biology.

[9]  C. Duarte,et al.  Marked changes in diversity and relative activity of picoeukaryotes with depth in the world ocean , 2019, The ISME Journal.

[10]  Frédéric Mahé,et al.  Swarm: robust and fast clustering method for amplicon-based studies , 2014, PeerJ.

[11]  Jose A Navas-Molina,et al.  Deblur Rapidly Resolves Single-Nucleotide Community Sequence Patterns , 2017, mSystems.

[12]  G. Cochrane,et al.  UniEuk: Time to Speak a Common Language in Protistology! , 2017, The Journal of eukaryotic microbiology.

[13]  Rick L. Stevens,et al.  A communal catalogue reveals Earth’s multiscale microbial diversity , 2017, Nature.

[14]  Alexandros Stamatakis,et al.  The State of Software for Evolutionary Biology , 2018, Molecular biology and evolution.

[15]  H. H. Bruun,et al.  Algorithm for post-clustering curation of DNA amplicon data yields reliable biodiversity estimates , 2017, Nature Communications.

[16]  Peter Sanders,et al.  Cache-, hash-, and space-efficient bloom filters , 2009, JEAL.

[17]  M. Dunthorn,et al.  Perspectives from Ten Years of Protist Studies by High‐Throughput Metabarcoding , 2020, The Journal of eukaryotic microbiology.

[18]  J. Handelsman,et al.  Introducing DOTUR, a Computer Program for Defining Operational Taxonomic Units and Estimating Species Richness , 2005, Applied and Environmental Microbiology.

[19]  Alexandros Stamatakis,et al.  SoftWipe – a tool and benchmark to assess scientific software quality , 2020 .