Swarm v2: highly-scalable and high-resolution amplicon clustering

Previously we presented Swarm v1, a novel and open source amplicon clustering program that produced fine-scale molecular operational taxonomic units (OTUs), free of arbitrary global clustering thresholds and input-order dependency. Swarm v1 worked with an initial phase that used iterative single-linkage with a local clustering threshold (d), followed by a phase that used the internal abundance structures of clusters to break chained OTUs. Here we present Swarm v2, which has two important novel features: (1) a new algorithm for d = 1 that allows the computation time of the program to scale linearly with increasing amounts of data; and (2) the new fastidious option that reduces under-grouping by grafting low abundant OTUs (e.g., singletons and doubletons) onto larger ones. Swarm v2 also directly integrates the clustering and breaking phases, dereplicates sequencing reads with d = 0, outputs OTU representatives in fasta format, and plots individual OTUs as two-dimensional networks.

[1]  Rob Knight,et al.  Open-Source Sequence Clustering Methods Improve the State Of the Art , 2016, mSystems.

[2]  T. Stoeck,et al.  Protistan diversity in a permanently stratified meromictic lake (Lake Alatsee, SW Germany). , 2015, Environmental microbiology.

[3]  Peer Bork,et al.  Determinants of community structure in the global plankton interactome , 2015, Science.

[4]  P. Bork,et al.  Eukaryotic plankton diversity in the sunlit ocean , 2015, Science.

[5]  Frédéric J. J. Chain,et al.  Divergence thresholds and divergent biodiversity estimates: can metabarcoding reliably describe zooplankton communities? , 2015, Ecology and evolution.

[6]  John Bunge,et al.  Comparing High‐throughput Platforms for Sequencing the V4 Region of SSU‐rDNA in Environmental Microbial Eukaryotic Diversity Surveys , 2015, The Journal of eukaryotic microbiology.

[7]  Ian T. Paulsen,et al.  Environmental Microbiology , 2022, Methods in Molecular Biology.

[8]  T. Stoeck,et al.  Deep sequencing uncovers protistan plankton diversity in the Portuguese Ria Formosa solar saltern ponds , 2014, Extremophiles.

[9]  Frédéric Mahé,et al.  Swarm: robust and fast clustering method for amplicon-based studies , 2014, PeerJ.

[10]  Rob Knight,et al.  The Earth Microbiome project: successes and aspirations , 2014, BMC Biology.

[11]  Antonio Gonzalez,et al.  Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences , 2014, PeerJ.

[12]  Jean-Michel Claverie,et al.  Patterns of Rare and Abundant Marine Microbial Eukaryotes , 2014, Current Biology.

[13]  Martin Wu,et al.  Surprisingly extensive mixed phylogenetic and ecological signals among bacterial Operational Taxonomic Units , 2013, Nucleic acids research.

[14]  Stéphane Audic,et al.  The Protist Ribosomal Reference database (PR2): a catalog of unicellular eukaryote Small Sub-Unit rRNA sequences with curated taxonomy , 2012, Nucleic Acids Res..

[15]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[16]  J. Bunge,et al.  Comparing the Hyper‐Variable V4 and V9 Regions of the Small Subunit rDNA for Assessment of Ciliate Environmental Diversity , 2012, The Journal of eukaryotic microbiology.

[17]  Richard Christen,et al.  Significant and persistent impact of timber harvesting on soil microbial communities in Northern coniferous forests , 2012, The ISME Journal.

[18]  Mihai Pop,et al.  DNACLUST: accurate and efficient clustering of phylogenetic marker genes , 2011, BMC Bioinformatics.

[19]  A. Stock,et al.  Delimiting operational taxonomic units for assessing ciliate environmental diversity using small-subunit rRNA gene sequences. , 2011, Environmental microbiology reports.

[20]  T. Stoeck,et al.  Depicting more accurate pictures of protistan community complexity using pyrosequencing of hypervariable SSU rRNA gene regions. , 2011, Environmental microbiology.

[21]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[22]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[23]  Susan M. Huse,et al.  Ironing out the wrinkles in the rare biosphere through improved OTU clustering , 2010, Environmental microbiology.

[24]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[25]  V. Kunin,et al.  Wrinkles in the rare biosphere: pyrosequencing errors can lead to artificial inflation of diversity estimates. , 2009, Environmental microbiology.

[26]  D. Caron,et al.  Defining DNA-Based Operational Taxonomic Units for Microbial-Eukaryote Ecology , 2009, Applied and Environmental Microbiology.