Document Retrieval on String Collections

Indexing data so that it can be easily searched is one of the most fundamental problems in computer science. Especially in the fields of databases and information retrieval, indexing is at the heart of query processing. One of the most popular indexes, used by all search engines, is the inverted index. However, in many cases like bioinformatics, eastern language texts, and phrase queries for Web, one may not be able to assume word demarcations. In such cases, these documents are to be seen as a string of characters. Thus, more sophisticated solutions are required for these string documents. Formally, we are given a collection of D documents D D fd1; d2; d3; : : : ; dDg. Each document di is a string drawn from the character set of size and the total number of characters across all the documents is n. Our task is to preprocess this collection and build a data structure so that queries can be answered as quickly as possible. The query consists of a pattern string P , of length p, drawn from . As the answer to the query, we are supposed output all the documents di in which this pattern P occurs as a substring. This is called the document listing problem. In a more advanced top-k version, the query consists of a tuple .P; k/ where k is an integer. Now, we are supposed to output only the k most relevant documents. This is called the top-k document retrieval problem. The notion of relevance is captured by a score function. The function score.P; d/ denotes the score of the document d with respect to the pattern P . It can be the number of times P occurs in d , known as term frequency, or the distance between two closest occurrences of P in d , or any other function. Here, we will assume that score.P; d/ is solely dependent on the set of occurrences of P in d and is known at the time of construction of the data structure.