Trigrams as index element in full text retrieval: observations and experimental results

A trigram is a three element sequence of characters. In this paper we demonstrate the effectiveness of a trigram based index for morphologically based retrievals from a full text document retrieval system. Retrieved documents are considered relevant if they contain exact matches for each of the query terms. Using this definition of relevance we consistently achieve a recall rate of 100%. In the experiments described here, we used sets of 100 anded three term queries, and the average precision per set varied from 47% to 87%. We propose a method for increasing the average precision to 100%. Using overlapping trigrams extracted from the Brown Corpus [KUCE67] and a character set of 45 elements, we found a horizontal asymptote near 11,000 for the number of entries in a trigram based index. Finally we show that a trigram based system provides a reasonable alternative to a word based one and is superior to it in retrievals of word fragments.