Bartender: a fast and accurate clustering algorithm to count barcode reads

Motivation Barcode sequencing (bar‐seq) is a high‐throughput, and cost effective method to assay large numbers of cell lineages or genotypes in complex cell pools. Because of its advantages, applications for bar‐seq are quickly growing—from using neutral random barcodes to study the evolution of microbes or cancer, to using pseudo‐barcodes, such as shRNAs or sgRNAs to simultaneously screen large numbers of cell perturbations. However, the computational pipelines for bar‐seq clustering are not well developed. Available methods often yield a high frequency of under‐clustering artifacts that result in spurious barcodes, or over‐clustering artifacts that group distinct barcodes together. Here, we developed Bartender, an accurate clustering algorithm to detect barcodes and their abundances from raw next‐generation sequencing data. Results In contrast with existing methods that cluster based on sequence similarity alone, Bartender uses a modified two‐sample proportion test that also considers cluster size. This modification results in higher accuracy and lower rates of under‐ and over‐clustering artifacts. Additionally, Bartender includes unique molecular identifier handling and a ‘multiple time point’ mode that matches barcode clusters between different clustering runs for seamless handling of time course data. Bartender is a set of simple‐to‐use command line tools that can be performed on a laptop at comparable run times to existing methods. Availability and implementation Bartender is available at no charge for non‐commercial use at https://github.com/LaoZZZZZ/bartender‐1.1. Contact sasha.levy@stonybrook.edu or song.wu@stonybrook.edu Supplementary information Supplementary data are available at Bioinformatics online.

[1]  A. Regev,et al.  Chromatin profiling by directly sequencing small quantities of immunoprecipitated DNA , 2010, Nature Methods.

[2]  James A. Gagnon,et al.  Whole-organism lineage tracing by combinatorial and cumulative genome editing , 2016, Science.

[3]  Ronald W. Davis,et al.  Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. , 1999, Science.

[4]  Sriram Kosuri,et al.  Causes and Effects of N-Terminal Codon Bias in Bacterial Genes , 2013, Science.

[5]  Sasha F. Levy,et al.  Beyond genome sequencing: lineage tracking with barcodes to study the dynamics of evolution, infection, and cancer. , 2014, Genomics.

[6]  J. Vijg,et al.  Direct mutation analysis by high-throughput sequencing: from germline to low-abundant, somatic variants. , 2012, Mutation research.

[7]  Marketa Zvelebil,et al.  High-throughput RNA interference screening using pooled shRNA libraries and next generation sequencing , 2011, Genome Biology.

[8]  L. Du,et al.  Global fitness profiling of fission yeast deletion strains by barcode sequencing , 2010, Genome Biology.

[9]  Michael T. McManus,et al.  Rapid Creation and Quantitative Monitoring of High Coverage shRNA Libraries , 2009, Nature Methods.

[10]  David Botstein,et al.  Yeast metabolic and signaling genes are required for heat-shock survival and have little overlap with the heat-induced genes , 2013, Proceedings of the National Academy of Sciences.

[11]  Susan P. Holmes,et al.  Denoising PCR-amplified metagenome data , 2012, BMC Bioinformatics.

[12]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[13]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[14]  Russell J. Davenport,et al.  Removing Noise From Pyrosequenced Amplicons , 2011, BMC Bioinformatics.

[15]  Vivek K. Mutalik,et al.  Composability of regulatory sequences controlling transcription and translation in Escherichia coli , 2013, Proceedings of the National Academy of Sciences.

[16]  Zechen Chong,et al.  Rainbow: an integrated tool for efficient clustering and assembling RAD-seq reads , 2012, Bioinform..

[17]  R. Rowntree,et al.  X-chromosome inactivation and epigenetic fluidity in human embryonic stem cells , 2008, Proceedings of the National Academy of Sciences.

[18]  Timothy K Lu,et al.  Massively parallel high-order combinatorial genetics in human cells , 2015, Nature Biotechnology.

[19]  William G. Mckendree,et al.  ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences , 2009, Nucleic acids research.

[20]  E. Lander,et al.  Genetic Screens in Human Cells Using the CRISPR-Cas9 System , 2013, Science.

[21]  G. Storz,et al.  Small RNAs and Small Proteins Involved in Resistance to Cell Envelope Stress and Acid Shock in Escherichia coli: Analysis of a Bar-Coded Mutant Collection , 2009, Journal of bacteriology.

[22]  S. Tregear,et al.  Long-term opioid management for chronic noncancer pain. , 2010, The Cochrane database of systematic reviews.

[23]  Jesse J. Salk,et al.  Detection of ultra-rare mutations by next-generation sequencing , 2012, Proceedings of the National Academy of Sciences.

[24]  Sasha F. Levy,et al.  iSeq: A New Double-Barcode Method for Detecting Dynamic Genetic Interactions in Yeast , 2016, G3: Genes, Genomes, Genetics.

[25]  Tao Jiang,et al.  SEED: efficient clustering of next-generation sequences , 2011, Bioinform..

[26]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[27]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[28]  S. Linnarsson,et al.  Counting absolute numbers of molecules using unique molecular identifiers , 2011, Nature Methods.

[29]  Gavin J. D. Smith,et al.  Origins and evolutionary genomics of the 2009 swine-origin H1N1 influenza A epidemic , 2009, Nature.

[30]  Kana Shimizu,et al.  SlideSort: all pairs similarity search for short reads , 2010, Bioinform..

[31]  H. Jungwirth,et al.  Systematic Phenotyping of a Large-Scale Candida glabrata Deletion Collection Reveals Novel Antifungal Tolerance Genes , 2014, PLoS pathogens.

[32]  Umer Zeeshan Ijaz,et al.  Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data , 2016, BMC Bioinformatics.

[33]  Irving L. Weissman,et al.  Tracking single hematopoietic stem cells in vivo using high-throughput sequencing in conjunction with viral genetic barcoding , 2011, Nature Biotechnology.

[34]  David Botstein,et al.  System-Level Analysis of Genes and Functions Affecting Survival During Nutrient Starvation in Saccharomyces cerevisiae , 2011, Genetics.

[35]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[36]  Guillaume J. Filion,et al.  Starcode: sequence clustering based on all-pairs search , 2015, Bioinform..

[37]  Ronald W. Davis,et al.  Functional profiling of the Saccharomyces cerevisiae genome , 2002, Nature.

[38]  Paul J. McMurdie,et al.  DADA2: High resolution sample inference from Illumina amplicon data , 2016, Nature Methods.

[39]  Ji Luo,et al.  Cancer Proliferation Gene Discovery Through Functional Genomics , 2008, Science.

[40]  A. Meyerhans,et al.  DNA recombination during PCR. , 1990, Nucleic acids research.

[41]  M. Hirst,et al.  Barcoding reveals complex clonal dynamics of de novo transformed human mammary cells , 2015, Nature.

[42]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[43]  M. Sogin,et al.  Minimum entropy decomposition: Unsupervised oligotyping for sensitive partitioning of high-throughput marker gene sequences , 2014, The ISME Journal.

[44]  Joshua M. Korn,et al.  Studying clonal dynamics in response to cancer therapy using high-complexity barcoding , 2015, Nature Medicine.

[45]  Ulrich Schlecht,et al.  A scalable double-barcode sequencing platform for characterization of dynamic protein-protein interactions , 2017, Nature Communications.

[46]  Gavin Sherlock,et al.  Quantitative evolutionary dynamics using high-resolution lineage tracking , 2015, Nature.