Text databases and information retrieval

The goal of a traditional information retrieval (IR) system is to search an information repository, such as a text database, and retrieve documents that are potentially relevant to a query. Since query-based IR systems must operate in real time, they must be able to search large volumes of text quickly and efficiently. Other information-retrieval applications, such as text categorization, text routing, and text filtering, are also becoming increasingly important. These applications are generally concerned with long-term information needs, where a topic is expected to be of interest for an extended period of time. Text categorization systems assign predefined category labels to texts. For example, a text categorization system for computer science might use categories such as operating systems, programming languages, artificial intelligence, or information retrieval. Text routing systems typically accept a set of user profiles and automatically classify texts so that relevant texts can be routed to appropriate users [Harman 1994]. Text filtering systems accept a list of topics that are, or are not, of interest and allow only texts that satisfy the filter to pass through to the user [Belkin and Croft 1992]. Text categorization systems are typically applied to static databases, while text routing and text filtering systems are usually applied to incoming data streams. Information-retrieval systems must grapple with all of the ambiguities and idiosyncrasies inherent in natural language, such as synonymy (e.g., “start”, “begin”, and “initiate” have essentially the same meaning) and polysemy (e.g., “shot” has many different meanings, including the act of shooting, an injection, a quantity of liquor, a photograph, pellets, or an attempt). Phrases also require special attention because multiword expressions often have a composite meaning different from the individual words. For example, a “hot dog” does not usually refer to a warm canine, and an “operating system” does not usually refer to a system that is simply operating. Most information-retrieval systems preprocess a document collection into an inverted file that allows the system to determine quickly which words appear in each document. Stopword lists are commonly used to remove highly frequent words, such as “the” and “of,” under the assumption that they don’t contribute much to the meaning of a text. Stemming algorithms are sometimes used to reduce a word to its root form so that different morphological variations will match [Frakes and Baeza-Yates 1992]. An alternative text-representation scheme uses superimposed codewords to produce a fixed-length vector from the binary representations of words. The fixed-length vector is especially useful for parallel and hardware systems, but this method can sometimes hallucinate words that don’t actually appear in the original document. Traditional information-retrieval methods retrieve documents by searching for relevant words or phrases. Most commercial IR systems allow the user to define a query using keywords and standard Boolean operators. These systems retrieve documents that precisely match the query. The vector-space model [Salton