MRCSI: Compressing and Searching String Collections with Multiple References

Efficiently storing and searching collections of similar strings, such as large populations of genomes or long change histories of documents from Wikis, is a timely and challenging problem. Several recent proposals could drastically reduce space requirements by exploiting the similarity between strings in so-called reference-based compression. However, these indexes are usually not searchable any more, i.e., in these methods search efficiency is sacrificed for storage efficiency. We propose Multi-Reference Compressed Search Indexes (MRCSI) as a framework for efficiently compressing dissimilar string collections. In contrast to previous works which can use only a single reference for compression, MRCSI (a) uses multiple references for achieving increased compression rates, where the reference set need not be specified by the user but is determined automatically, and (b) supports efficient approximate string searching with edit distance constraints. We prove that finding the smallest MRCSI is NP-hard. We then propose three heuristics for computing MRCSIs achieving increasing compression ratios. Compared to state-of-the-art competitors, our methods target an interesting and novel sweet-spot between high compression ratio versus search efficiency.

[1]  Hector Ferrada,et al.  Hybrid indexes for repetitive datasets , 2013, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[2]  Justin Zobel,et al.  Sample selection for dictionary-based corpus compression , 2011, SIGIR '11.

[3]  Knut Reinert,et al.  SeqAn An efficient, generic C++ library for sequence analysis , 2008, BMC Bioinformatics.

[4]  Klaus U. Schulz,et al.  WallBreaker: overcoming the wall effect in similarity search , 2013, EDBT '13.

[5]  Martin Cohn,et al.  Parsing with Prefix and Suffix Dictionaries. , 1996, DCC 1996.

[6]  Timothy B. Stockwell,et al.  Evaluation of next generation sequencing platforms for population targeted sequencing studies , 2009, Genome Biology.

[7]  Szymon Grabowski,et al.  Indexes of Large Genome Collections on a PC , 2014, PloS one.

[8]  Guoliang Li,et al.  Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints , 2013, EDBT '13.

[9]  Ricardo A. Baeza-Yates,et al.  Fast and Practical Approximate String Matching , 1992, Inf. Process. Lett..

[10]  Shirley Dex,et al.  JR 旅客販売総合システム(マルス)における運用及び管理について , 1991 .

[11]  Bin Wang,et al.  Efficient direct search on compressed genomic data , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[12]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[13]  Szymon Grabowski,et al.  Robust relative compression of genomes with random access , 2011, Bioinform..

[14]  Justin Zobel,et al.  Relative Lempel-Ziv Factorization for Efficient Storage and Retrieval of Web Collections , 2011, Proc. VLDB Endow..

[15]  Kevin Chen-Chuan Chang,et al.  EntityRank: Searching Entities Directly and Holistically , 2007, VLDB.

[16]  Ulf Leser,et al.  Trends in Genome Compression , 2014 .

[17]  Gonzalo Navarro,et al.  Document Listing on Repetitive Collections , 2013, CPM.

[18]  Ulf Leser,et al.  RCSI: Scalable similarity search in thousand(s) of genomes , 2013, Proc. VLDB Endow..

[19]  Szymon Grabowski,et al.  Indexing large genome collections on a PC , 2014, ArXiv.

[20]  Ulf Leser,et al.  FRESCO: Referential Compression of Highly Similar Sequences , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[21]  Gonzalo Navarro,et al.  Document Retrieval on Repetitive Collections , 2014, ESA.

[22]  Justin Zobel,et al.  Principled dictionary pruning for low-memory corpus compression , 2014, SIGIR.

[23]  Guillaume Wisniewski,et al.  Mining Naturally-occurring Corrections and Paraphrases from Wikipedia’s Revision History , 2022, LREC.

[24]  Miguel A. Martínez-Prieto,et al.  Indexes for highly repetitive document collections , 2011, CIKM '11.

[25]  Sara P. Garcia,et al.  GReEn: a tool for efficient compression of genome resequencing data , 2011, Nucleic acids research.

[26]  Knut Reinert,et al.  Segment-based multiple sequence alignment , 2008, ECCB.

[27]  Marcin Zukowski,et al.  Super-Scalar RAM-CPU Cache Compression , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[28]  B. Berger,et al.  Compressive genomics , 2012, Nature Biotechnology.

[29]  N. Warthmann,et al.  Simultaneous alignment of short reads against multiple genomes , 2009, Genome Biology.

[30]  Martin Cohn,et al.  Parsing with suffix and prefix dictionaries , 1996, Proceedings of Data Compression Conference - DCC '96.

[31]  Gonzalo Navarro,et al.  On compressing and indexing repetitive sequences , 2013, Theor. Comput. Sci..

[32]  Justin Zobel,et al.  Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval , 2010, SPIRE.

[33]  Wing-Kai Hon,et al.  Inverted indexes for phrases and strings , 2011, SIGIR.

[34]  Hugh E. Williams,et al.  A general-purpose compression scheme for large collections , 2002, TOIS.

[35]  Khalid Choukri,et al.  The european language resources association , 1998, LREC.

[36]  Szymon Grabowski,et al.  Data compression for sequencing data , 2013, Algorithms for Molecular Biology.

[37]  Hector Ferrada,et al.  AliBI: An Alignment-Based Index for Genomic Datasets , 2013, ArXiv.

[38]  Veli Mäkinen,et al.  Indexing Graphs for Path Queries with Applications in Genome Research , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[39]  Wing-Kai Hon,et al.  Breaking a Time-and-Space Barrier in Constructing Full-Text Indices , 2009, SIAM J. Comput..

[40]  A. Moffat,et al.  Offline dictionary-based compression , 2000, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[41]  Santiago Moisés Mola-Velasco,et al.  Wikipedia vandalism detection , 2011, WWW.

[42]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.