Indexing non-traditional and multimedia databases

We examine methods to achieve fast searching for non-traditional and multimedia datatypes. Target queries are 'queries by example', such as 'find all the stocks that have similar performance as IBM's stock'. The general approach is to use k feature extraction functional to map the data items into points in an k-dimensional space, and then to use spatial access methods for clustering and indexing. In this dissertation, we look into ways of enhancing the general approach, as well as methods for specific datatypes. We first examine the problem of sequence matching, when translation, scaling and gaps are allowed. We present a fast algorithm, which first locates quickly promising subsequences using an index tree. Then using these pairs of matching subsequences, the algorithm eliminates possibilities that cannot be 'stitched' together, and returns the rest. Experiments on real stock prices data showed that the algorithm can identify unexpected matches. The second problem we investigate is the 'dimensionality curse', where the number of features k is large; in this case, we propose the TV-tree, which postpones the effects of the dimensionality curse by using only the necessary (and probably small) number of features, to distinguish among the data items in the collection. We implement the TV-tree and run experiments, showing that it outperforms the R*-tree, among the best of the fixed dimensionality structures. The third problem we look at is the 'featureless' case, where we are only given a dissimilarity function, but no features: in this case, we propose a linear time algorithm ('FastMap'), which maps data items into points, while preserving most of the distance/dissimilarity information. We show that the method is much faster than the traditional technique, 'multidimensional scaling', and that it provides excellent visualization for several applications, including documents and video clips.