Twelve years of SAMtools and BCFtools

Abstract Background SAMtools and BCFtools are widely used programs for processing and analysing high-throughput sequencing data. They include tools for file format conversion and manipulation, sorting, querying, statistics, variant calling, and effect analysis amongst other methods. Findings The first version appeared online 12 years ago and has been maintained and further developed ever since, with many new features and improvements added over the years. The SAMtools and BCFtools packages represent a unique collection of tools that have been used in numerous other software projects and countless genomic pipelines. Conclusion Both SAMtools and BCFtools are freely available on GitHub under the permissive MIT licence, free for both non-commercial and commercial use. Both packages have been installed >1 million times via Bioconda. The source code and documentation are available from https://www.htslib.org.

[1]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[2]  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI.

[3]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[4]  Heng Li,et al.  Improving SNP discovery by base alignment quality , 2011, Bioinform..

[5]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[6]  Heng Li,et al.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data , 2011, Bioinform..

[7]  Markus Hsi-Yang Fritz,et al.  Efficient storage of high throughput DNA sequencing data using reference-based compression. , 2011, Genome research.

[8]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[9]  Derek Bruening,et al.  AddressSanitizer: A Fast Address Sanity Checker , 2012, USENIX Annual Technical Conference.

[10]  German Tischler,et al.  biobambam: tools for read pair collation based algorithms on BAM files , 2013, Source Code for Biology and Medicine.

[11]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[12]  Edwin Cuppen,et al.  Sambamba: fast processing of NGS alignment formats , 2015, Bioinform..

[13]  N. Petronella,et al.  Choice of reference-guided sequence assembler and SNP caller for analysis of Listeria monocytogenes short-read sequence data greatly influences rates of error , 2015, BMC Research Notes.

[14]  F. Cunningham,et al.  The Ensembl Variant Effect Predictor , 2016, Genome Biology.

[15]  Shane A. McCarthy,et al.  A Method for Checking Genomic Integrity in Cultured Cell Lines from SNP Genotyping Data , 2016, PloS one.

[16]  Yali Xue,et al.  BCFtools/RoH: a hidden Markov model approach for detecting autozygosity from next-generation sequencing data , 2016, Bioinform..

[17]  Petr Danecek,et al.  BCFtools/csq: haplotype-aware variant consequences , 2016, bioRxiv.

[18]  Renan Valieris,et al.  Bioconda: sustainable and comprehensive software distribution for the life sciences , 2018, Nature Methods.

[19]  James K. Bonfield,et al.  Crumble: reference free lossy compression of sequence quality values , 2019, Bioinform..

[20]  Fenglin Liu,et al.  Systematic comparative analysis of single-nucleotide variant detection methods from single-cell RNA sequencing data , 2019, Genome Biology.

[21]  Andreas Rempel,et al.  Comparison of Read Mapping and Variant Calling Tools for the Analysis of Plant NGS Data , 2020, Plants.

[22]  Thomas M. Keane,et al.  HTSlib: C library for reading/writing high-throughput sequencing data , 2020, bioRxiv.