Text structure recognition using a region algebra

We consider the problem of incrementally developing a parser for text structure. This means building the parser specification a piece at a time while simultaneously developing our understanding of the text. We argue that existing solutions have usability and efficiency problems for this application and propose an alternative approach based on the type of region algebra that is often used as a query language for text databases. This is an appropriate interface for incremental development, but has no efficient batch parsing model such as those that exist for grammars. In this thesis, we propose an efficient batch parsing model and characterize the region algebras to which it applies.

[1]  J. Howard Johnson Single-Valued Finite Transduction , 1987, ICALP.

[2]  Udi Manber,et al.  GLIMPSE: A Tool to Search Through Entire File Systems , 1994, USENIX Winter.

[3]  Wilf R. LaLonde,et al.  Regular right part grammars and their parsers , 1977, CACM.

[4]  Ian H. Witten,et al.  Learning text editing tasks from examples: a procedural approach , 1992 .

[5]  Jay Earley,et al.  An efficient context-free parsing algorithm , 1970, Commun. ACM.

[6]  Craig A. Knoblock,et al.  Wrapper generation for semi-structured Internet sources , 1997, SGMD.

[7]  Ricardo A. Baeza-Yates,et al.  Integrating contents and structure in text retrieval , 1996, SGMD.

[8]  S. E. Keller,et al.  Tree transformation techniques and experiences , 1984, SIGPLAN '84.

[9]  Alfred V. Aho,et al.  The Theory of Parsing, Translation, and Compiling , 1972 .

[10]  Peter Fankhauser,et al.  MarkItUp! An Incremental Approach to Document Structure Recognition , 1993, Electron. Publ..

[11]  Gregory Grefenstette Light parsing as finite state filtering , 1999 .

[12]  Robert P. Nix,et al.  Editing by example , 1984 .

[13]  Charles L. A. Clarke,et al.  Shortest-substring retrieval and ranking , 2000, TOIS.

[14]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[15]  V. Quint,et al.  Text processing and document manipulation: Grif: An Interactive System for Structured Document Manipulation , 1986 .

[16]  Ricardo A. Baeza-Yates,et al.  A language for queries on structure and contents of textual databases , 1995, SIGIR '95.

[17]  Richard Edwin Stearns,et al.  Syntax-Directed Transduction , 1966, JACM.

[18]  John Howard Johnson Formal models for string similarity , 1983 .

[19]  Leslie G. Valiant,et al.  General Context-Free Recognition in Less than Cubic Time , 1975, J. Comput. Syst. Sci..

[20]  Jay Earley,et al.  An efficient context-free parsing algorithm , 1970, Commun. ACM.

[21]  Charles L. A. Clarke,et al.  An Algebra for Structured Text Search and a Framework for its Implementation , 1995, Comput. J..

[22]  Craig A. Knoblock,et al.  Modeling Web Sources for Information Integration , 1998, AAAI/IAAI.

[23]  G. Grefenstette Light Parsing as Finite-State FilteringGregory GrefenstetteRank , 1996 .

[24]  James A. Thom,et al.  Indexing Structured Text for Queries on Containment Relationships , 1996, Australasian Database Conference.

[25]  Lauri Karttunen,et al.  The Replace Operator , 1995, ACL.

[26]  Brad Adelberg,et al.  NoDoSE - A Tool for Semi-Automatically Extracting Semi-Structured Data from Text Documents , 1998, SIGMOD Conference.

[27]  Gaston H. Gonnet,et al.  Fast text searching for regular expressions or automaton searching on tries , 1996, JACM.

[28]  Rob Miller,et al.  Lightweight Structured Text Processing , 1999, USENIX Annual Technical Conference, General Track.

[29]  James F. Gimpel A theory of discrete patterns and their implementation in SNOBOL4 , 1973, Commun. ACM.

[30]  Zachary G. Ives,et al.  EÆcient Evaluation of Regular Path Expressions on Streaming XML Data , 2000 .

[31]  James Clark,et al.  XSL Transformations (XSLT) Version 1.0 , 1999 .

[32]  Charles F. Goldfarb,et al.  SGML handbook , 1990 .

[33]  Charles L. A. Clarke,et al.  On the use of regular expressions for searching text , 1997, TOPL.

[34]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[35]  Murray Hill,et al.  Yacc: Yet Another Compiler-Compiler , 1978 .

[36]  Tova Milo,et al.  Algebras for Querying Text Regions: Expressive Power and Optimization , 1998, J. Comput. Syst. Sci..

[37]  Lauri Karttunen Directed Replacement , 1996, ACL.

[38]  Alfred V. Aho,et al.  Awk — a pattern scanning and processing language , 1979, Softw. Pract. Exp..

[39]  Steven J. DeRose,et al.  Markup systems and the future of scholarly text processing , 1987, CACM.

[40]  Samuel Eilenberg,et al.  Automata, languages, and machines. A , 1974, Pure and applied mathematics.

[41]  Jani Jaakkola Nested text-region algebra , 1999 .

[42]  Jean-Pierre Chanod,et al.  Incremental Finite-State Parsing , 1997, ANLP.

[43]  Aristides Gionis,et al.  XTRACT: a system for extracting document type descriptors from XML documents , 2000, SIGMOD 2000.

[44]  Frank Wm. Tompa,et al.  Shortening the OED: experience with a grammar-defined database , 1992, TOIS.

[45]  Jean Berstel,et al.  Transductions and context-free languages , 1979, Teubner Studienbücher : Informatik.