Algebras for querying text regions (extended abstract)

There is a significant amount of interest in combining and extending database and information retrieval technologies to manage textual data. The challenge is becoming more relevant due to the increased availability of documents in digital form. Document data has a natural hierarchical structure, which may be made explicit due to the use of markup conventions (as it is the case with SGML). An important aspect of managing structured and semi-structured textual data consists of supporting the efficient retrieval of text components based both on their content and structure. In this paper we study issues related to the expressive power and optimization of a class of algebras that support combining string (or pattern) searches with queries on the hierarchical structure of the text. The region algebra studied is a set-at-a-time algebra for manipulating -tezt regions (substrings of the text) that supports finding out nesting and ordering properties of the text regions. The region algebra is part of the language in use in commercial text retrieval systems, and can be implemented very efficiently. The results in this work are obtained by showing a close relationship between the region algebra and the monadic first order theory of binary trees. We show that queries in the algebra can be optimized, but the optimization can be difficult (Co-NP-Hard in the general case, although there is an important class of queries that can be optimized in polynomial time). On the negative side, we show that the language is incapable of capturing some important properties of the text structure, related to the nesting and ordering of text regions. We conclude by suggesting possible extensions *Research done while the author was at the University of Toronto. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice ISgiven that copying is by permission of the Association of Computing Machinery.To copy otherwise, or to republish, requires a fee and/or specific permission, PODS ’95 San Jose CA USA (3 1995 ACM 0-89791 -730-8/95/0005.. $3.50 Tova Mile* Department of Computer Science

[1]  C. Ward Henson,et al.  A Uniform Method for Proving Lower Bounds on the Computational Complexity of Logical Theories , 1990, Ann. Pure Appl. Log..

[2]  M. Rabin Decidability of second-order theories and automata on infinite trees , 1968 .

[3]  Wolfgang Thomas,et al.  Automata on Infinite Objects , 1991, Handbook of Theoretical Computer Science, Volume B: Formal Models and Sematics.

[4]  Peter M. Schwarz,et al.  The Rufus System: Information Organization for Semi-Structured Data , 1993, VLDB.

[5]  Serge Abiteboul,et al.  From structured documents to novel query facilities , 1994, SIGMOD '94.

[6]  Ralf Hartmut Güting,et al.  An algebra for structured office documents , 1989, TOIS.

[7]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[8]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[9]  Ricardo A. Baeza-Yates,et al.  A language for queries on structure and contents of textual databases , 1995, SIGIR '95.

[10]  Gaston H. Gonnet,et al.  Mind Your Grammar: a New Approach to Modelling Text , 1987, VLDB.

[11]  A. Ehrenfeucht An application of games to the completeness problem for formalized theories , 1961 .

[12]  Serge Abiteboul,et al.  Querying and Updating the File , 1993, VLDB.

[13]  Forbes J. Burkowski,et al.  An Algebra for Hierarchically Organized Text-Dominate Databases , 1992, Inf. Process. Manag..

[14]  A. Paepcke An object-oriented view onto public, heterogeneous text databases , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[15]  Tova Milo,et al.  Optimizing queries on files , 1994, SIGMOD '94.

[16]  Charles L. A. Clarke,et al.  An Algebra for Structured Text Search and a Framework for its Implementation , 1995, Comput. J..

[17]  Douglas B. Terry,et al.  Using collaborative filtering to weave an information tapestry , 1992, CACM.

[18]  C. Faloutsos Eecient Similarity Search in Sequence Databases , 1993 .

[19]  Hector Garcia-Molina,et al.  The Gold Mailer , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[20]  Jeffrey D. Ullman,et al.  Principles of Database and Knowledge-Base Systems, Volume II , 1988, Principles of computer science series.