Development of indexing compressed structure for analyzing a collection of similar genomes: application to rice

As the cost of DNA sequencing decreases, the high throughput sequencing technologies become more and more accessible to many laboratories. Consequently, new issues emerge that require new algorithms including tools for indexing and compressing thousands of genomes, as for example the 3000 rice genomes project [1], for which we are particularly interested in. Genomes can be considered as very large texts on a simple alphabet ∑ = {A, C, G, T }, We can refer to indexable dictionary problem which consists in storing a set ⊆ {0, . . . , i, . . . , m− 1} of an universe U = n. B(n) where B[i] = 1 () i ∈ S. The indexable dictionary problem support two additionnal operations ranks(i) and selects(i) for s ∈ {0, 1}. The function ranks(i) returns the number of elements (s) up to i and selects(i) returns the position of the ith occurence of s. The indexation of complete genomes is an important stage in the exploration and understanding of data from living organisms. An efficient index should provide a quick answer to the following questions. -How many times a given pattern does appear in the genome? - What are the positions of a given pattern? -What is the pattern length at the ith position in the genome? The common way to structure index and compress one genome is to use the Burrows-Wheeler Transform –BWT)[2] with the FM-index [3] on BWT sequences for requests. If you want to index several genomes with one reference genome you may use MuGI [4]. To build MuGI index they store the reference in compact form (4 bits to encode single char), a variant database, one bit vector for each variant and an array kMA keeping information about each k-mers. This is a really interesting approach but it needs to have a reference genom. We present a structure which proposes a solution to index and compress very repetitive sequences over small alphabet in texts using k-mers. k-mers are factors of length k in the considered sequences. We built a 4k1 array, where k1 < k, and each entry, namely an array, is indexed by a prefix of size k1 of existing k-mers. In each prefix array we insert a 4k2 bit vector which represents all possible k-mers begining with the considered prefix. We will use libGkArray [5] to query a large read collections and update our structure. We chose libGkArray instead of JellyFish [6] and KMC (any versions) [7] in main memory. To build the index, we cut our genomes into k-mers, for each k-mer we split the k-mer into prefix suffix of respective size k1 and k2. We call the function kmer _ to _int() which takes a k-mer and returns its integer value. We then go into the prefix array PA[kmer to int(k1)] and we add k2 to our suyx array. We also add a 1 in the succint structure to Gi i ϵ n with n the number of genomes as you can see at Fig.1. Given a n for the number of genomes and N for all k-mers in the genome set, we can estimate the time and space complexity as respectively O(N log(n)) and O(N × 2k2 log(n + N )). Our structure has to be eycient in memory space and comuting time.