CompMap: a reference-based compression program to speed up read mapping to related reference sequences

SUMMARY Exhaustive mapping of next-generation sequencing data to a set of relevant reference sequences becomes an important task in pathogen discovery and metagenomic classification. However, the runtime and memory usage increase as the number of reference sequences and the repeat content among these sequences increase. In many applications, read mapping time dominates the entire application. We developed CompMap, a reference-based compression program, to speed up this process. CompMap enables the generation of a non-redundant representative sequence for the input sequences. We have demonstrated that reads can be mapped to this representative sequence with a much reduced time and memory usage, and the mapping to the original reference sequences can be recovered with high accuracy. AVAILABILITY AND IMPLEMENTATION CompMap is implemented in C and freely available at http://csse.szu.edu.cn/staff/zhuzx/CompMap/. CONTACT xiaoyang@broadinstitute.org SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.