Indexing and querying of sequence databases

Sequence databases arise in many real world applications. Time series data, such as stock market data, and biological data, such as DNA, and protein sequences and protein structures are only a few examples. Next generation sequence databases (a) encompass terabytes of data; (b) use complex measures of similarity/distance; (c) contain data of arbitrary size. Items (a) to (c) imply that indexing and querying of these databases are difficult tasks. This dissertation is concerned with scalable indexing methods that enable efficient querying of sequence databases. Three types of sequence databases are considered. The first one is the time series data. An index structure that allows optimal querying when the query length is not known in advance is proposed. Also, shift/scale invariance, when multiple attributes involved is discussed. The second one is the genome sequences, such as DNA and protein sequences. A novel vector space embedding is developed. This embedding is then utilized for (i) searching a query sequence, (ii) comparison of large sequences, and (iii) incremental searches. The third one is the protein structures. An efficient index is proposed that enables searches at the Secondary Structure Element (SSE) level. Finally, a single index structure that combines both sequence and structure of proteins is developed.