PAT expressions: an algebra for text search

Text search operations are used to locate and retrieve needed information from some text collection. In traditional information retrieval, text search is a means for identifying relevant documents [Salton83, Lee85]. By specifying selection criteria for the text of a document, the reader can choose a subset of a given set of documents. If the text collection is defined not as a set of documents, but more generally as a structure containing some parts, then text search involves the specification of those parts of interest to the reader. The structure of the documents may be determined by the search system, by the author, by the text installer, or by the reader. In the PAT system [Gonnet87a, Fawcett89a, Fawcett89b] text search operations are expressions that efficiently combine traditional search capabilities with some new, powerful facilities. PAT contains means for lexical search, proximity search, contextual search and Boolean search [Hollaar79, Larson84, Lee85, Burkowski91]. It also contains more rare operation types, including position and frequency search. Furthermore, a novel feature in PAT is the capability by which a reader can define structures for a text and use these structures in subsequent operations. One of the goals of this paper is to introduce the powerful search capabilities of PAT expressions.

[1]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[2]  Gerald Salton,et al.  Automatic text processing , 1988 .

[3]  Gerard Salton,et al.  A blueprint for automatic indexing , 1981, SIGF.

[4]  Gaston H. Gonnet,et al.  Unstructured data bases or very efficient text searching , 1983, PODS.

[5]  Darrell R. Raymond,et al.  Flexible text display with Lector , 1992, Computer.

[6]  Dik Lun Lee The design and evaluation of a text retrieval machine for large databases , 1985 .

[7]  Jean Tague,et al.  A Complete Model for Information Retrieval Systems. , 1991, SIGIR 1991.

[8]  Christos Faloutsos,et al.  Signature files: design and performance comparison of some signature extraction methods , 1985, SIGMOD Conference.

[9]  Per-Åke Larson,et al.  A Method for Speeding Up Text Retrieval , 1983, Databases for Business and Office Applications.

[10]  Gaston H. Gonnet,et al.  Lexicographical Indices for Text: Inverted files vs. PAT trees , 1991 .

[11]  Lee A. Hollaar,et al.  Text Retrieval Computers , 1979, Computer.

[12]  Antonio Zampolli,et al.  Computational lexicology and lexicography : special issue dedicated to Bernard Quemada , 1990 .

[13]  Darrell R. Raymond,et al.  Playing detective with full text searching software , 1990, SIGDOC '90.

[14]  Forbes J. Burkowski Textriever: a retrieval engine for multimedia databases , 1991 .

[15]  Heather Fawcett,et al.  The "New Oxford English Dictionary" Project. , 1993 .

[16]  Gaston H. Gonnet,et al.  Mind Your Grammar: a New Approach to Modelling Text , 1987, VLDB.

[17]  Charles F. Goldfarb,et al.  SGML handbook , 1990 .

[18]  Frank Wm. Tompa An Overview of Waterloo's Database Software for the OED [1992, rptd. 1996, 2008] , 1996 .