An Algebra for Structured Text Search and a Framework for its Implementation

A query algebra is presented that expresses searches on structured text. In addition to traditional full-text boolean queries that search a pre-defined collection of documents, the algebra permits queries that harness document structure. The algebra manipulates arbitrary intervals of text, which are recognized in the text from implicit or explicit markup. The algebra has seven operators, which combined intervals to yield new ones: containing, not containing, contained in, not contained in, one of, both of, followed by. The ultimate result of a query is the set of intervals that satisfy it. An implementation framework is given based on four primitive access functions. Each access function finds the solution to a query nearest to a given position in the database. Recursive definitions for the seven operators are given in terms of these access functions. Search time is at worst proportional to the time required to evaluate the access functions for occurrences of the elementary terms in a query

[1]  Gaston H. Gonnet,et al.  Mind Your Grammar: a New Approach to Modelling Text , 1987, VLDB.

[2]  Forbes J. Burkowski Surrogate subsets: a free space management strategy for the index of a text retrieval system , 1989, SIGIR '90.

[3]  Marc Gyssens,et al.  A grammar-based approach towards unifying hierarchical data models , 1989, SIGMOD '89.

[4]  Frank Wm. Tompa,et al.  Shortening the OED: experience with a grammar-defined database , 1992, TOIS.

[5]  Gaston H. Gonnet,et al.  New Indices for Text: Pat Trees and Pat Arrays , 1992, Information Retrieval: Data Structures & Algorithms.

[6]  Ian A. Macleod A Query Language for Retrieving Information from Hierarchic Text Structures , 1991, Comput. J..

[7]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[8]  Frank Wm. Tompa,et al.  Text / Relational Database Management Systems: Harmonizing SQL and SGML , 1994, ADB.

[9]  Donna K. Harman,et al.  Relevance Feedback and Other Query Modification Techniques , 1992, Information retrieval (Boston).

[10]  Marc Gyssens,et al.  A grammar-based approach towards unifying hierarchical data models , 1989, SIGMOD '89.

[11]  Ralf Hartmut Güting,et al.  An algebra for structured office documents , 1989, TOIS.

[12]  Robin Milner,et al.  Definition of standard ML , 1990 .

[13]  Ron Sacks-Davis,et al.  Database Systems for Structured Documents , 1995, IEICE Trans. Inf. Syst..

[14]  Donald E. Knuth,et al.  The TeXbook , 1984 .

[15]  Forbes J. Burkowski,et al.  An Algebra for Hierarchically Organized Text-Dominate Databases , 1992, Inf. Process. Manag..

[16]  Arjan Loeffen Text databases: a survey of text models and systems , 1994, SGMD.

[17]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[18]  Donald E. Knuth,et al.  The T E Xbook , 1987 .

[19]  Martin Bryan,et al.  SGML - an authors guide to the Standard Generalized Markup Language , 1988 .

[20]  Serge Abiteboul,et al.  From structured documents to novel query facilities , 1994, SIGMOD '94.

[21]  Donna Harman,et al.  Retrieving Records from a Gigabyte of Text on a Minicomputer Using Statistical Ranking. , 1990 .

[22]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .