论文信息 - On the repetitive collection indexing problem

On the repetitive collection indexing problem

In large data sets such as genomes from a single species, large sets of reads, and version control data it is often noted that each entry only differs from another by a very small number of variations. This leads to a large set of data with a great deal of redundancy and repetitiveness. Rapid development in DNA sequencing technologies has caused a drastic growth in the size of publicly available sequence databases with such data. DNA sequencing has become so fast and cost-effective that sequencing individual genomes will soon become a common task [9] making querying and storing such sets of data an important task. In this paper, we propose an indexing structure for highly repetitive collections of sequence data based on a multilevel g-gram model. In particular, the proposed algorithm accommodates variations that may occur in the target sequence with respect to the reference sequence. The paper is organized as follows. Section [1] and [2] introduce the basic concepts and go through the related literature. In Section [3] we present notions and facts. Details of the proposed data structure/algorithm will be given in Section [5] and [4], Section [6] discusses complexity analysis and Section [7] gives conclusions of future work.

Costas S. Iliopoulos | Carl Barton | Ali Alatabbi

[1] Ellen R. Bergeman,et al. Graph database systems , 1995 .

[2] Gonzalo Navarro,et al. A guided tour to approximate string matching , 2001, CSUR.

[3] Anthony K. H. Tung,et al. Indexing DNA Sequences Using q-Grams , 2005, DASFAA.

[4] Costas S. Iliopoulos,et al. Querying Highly Similar Structured Sequences via Binary Encoding and Word Level Operations , 2012, AIAI.

[5] Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[6] Costas S. Iliopoulos,et al. Querying highly similar sequences , 2013, Int. J. Comput. Biol. Drug Des..

[7] Gonzalo Navarro,et al. Storage and Retrieval of Individual Genomes , 2009, RECOMB.

[8] Derick Wood,et al. Approximate string matching with suffix automata , 2005, Algorithmica.

[9] Siu-Ming Yiu,et al. Indexing Similar DNA Sequences , 2010, AAIM.

[10] Esko Ukkonen,et al. Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[11] Costas S. Iliopoulos,et al. An algorithm for mapping short reads to a dynamically changing genomic sequence , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[12] Sang Joon Kim,et al. A Mathematical Theory of Communication , 2006 .

[13] Costas S. Iliopoulos,et al. An algorithm for mapping short reads to a dynamically changing genomic sequence , 2012, J. Discrete Algorithms.