On the repetitive collection indexing problem

In large data sets such as genomes from a single species, large sets of reads, and version control data it is often noted that each entry only differs from another by a very small number of variations. This leads to a large set of data with a great deal of redundancy and repetitiveness. Rapid development in DNA sequencing technologies has caused a drastic growth in the size of publicly available sequence databases with such data. DNA sequencing has become so fast and cost-effective that sequencing individual genomes will soon become a common task [9] making querying and storing such sets of data an important task. In this paper, we propose an indexing structure for highly repetitive collections of sequence data based on a multilevel g-gram model. In particular, the proposed algorithm accommodates variations that may occur in the target sequence with respect to the reference sequence. The paper is organized as follows. Section [1] and [2] introduce the basic concepts and go through the related literature. In Section [3] we present notions and facts. Details of the proposed data structure/algorithm will be given in Section [5] and [4], Section [6] discusses complexity analysis and Section [7] gives conclusions of future work.