Hybrid indexes for repetitive datasets

Advances in DNA sequencing mean that databases of thousands of human genomes will soon be commonplace. In this paper, we introduce a simple technique for reducing the size of conventional indexes on such highly repetitive texts. Given upper bounds on pattern lengths and edit distances, we pre-process the text with the lossless data compression algorithm LZ77 to obtain a filtered text, for which we store a conventional index. Later, given a query, we find all matches in the filtered text, then use their positions and the structure of the LZ77 parse to find all matches in the original text. Our experiments show that this also significantly reduces query times.

[1]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[2]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[3]  Volker Heun,et al.  Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays , 2011, SIAM J. Comput..

[4]  Hiroshi Sakamoto,et al.  ESP-index: A compressed index based on edit-sensitive parsing , 2013, J. Discrete Algorithms.

[5]  Gonzalo Navarro,et al.  Improved Grammar-Based Compressed Indexes , 2012, SPIRE.

[6]  Paolo Ferragina,et al.  Indexing compressed text , 2005, JACM.

[7]  Kunihiko Sadakane,et al.  Fast Relative Lempel-Ziv Self-index for Similar Sequences , 2012, FAW-AAIM.

[8]  Siu-Ming Yiu,et al.  SOAP3: ultra-fast GPU-based parallel alignment tool for short reads , 2012, Bioinform..

[9]  Esko Ukkonen,et al.  Lempel-Ziv parsing and sublinear-size index structures for string matching , 1996 .

[10]  Binhai Zhu,et al.  Frontiers in Algorithmics and Algorithmic Aspects in Information and Management , 2013, Lecture Notes in Computer Science.

[11]  Gonzalo Navarro,et al.  Storage and Retrieval of Highly Repetitive Sequence Collections , 2010, J. Comput. Biol..

[12]  Simon J. Puglisi,et al.  Faster Approximate Pattern Matching in Compressed Repetitive Texts , 2011, ISAAC.

[13]  Juha Kärkkäinen,et al.  A Faster Grammar-Based Self-index , 2011, LATA.

[14]  Gonzalo Navarro,et al.  On compressing and indexing repetitive sequences , 2013, Theor. Comput. Sci..

[15]  Gonzalo Navarro,et al.  Approximate String Matching with Compressed Indexes , 2009, Algorithms.

[16]  Gonzalo Navarro,et al.  Stronger Lempel-Ziv Based Compressed Text Indexing , 2012, Algorithmica.

[17]  Ulf Leser,et al.  RCSI: Scalable similarity search in thousand(s) of genomes , 2013, Proc. VLDB Endow..

[18]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.