Logic-based web information extraction

The Web wrapping proble, i.e., the problem of extracting structured information from HTML documents, is one of great practical importance. The often observed information overload that users of the Web experience witnesses the lack of intelligent and encompassing Web services that provide high-quality collected and value-added inforamtion. The Web wrapping problem has been addressed by a significant amount of research work. Previous work can be classified into two categories, depending on whether the HTML input is regarded as a sequential character string (e.g., [34, 27, 24, 30, 23]) or a pre-parsed document tree (for instance, [35, 25, 22, 29, 3, 2, 26]). The latter category of work thus assumes that systems may make use of an existing HTML parser as a front and.

[1]  Wolfgang Thomas,et al.  Languages, Automata, and Logic , 1997, Handbook of Formal Languages.

[2]  Jeffrey D. Ullman,et al.  A Query Translation Scheme for Rapid Implementation of Wrappers , 1995, DOOD.

[3]  James W. Thatcher,et al.  Generalized finite automata theory with an application to a decision problem of second-order logic , 1968, Mathematical systems theory.

[4]  Michel Minoux,et al.  LTUR: A Simplified Linear-Time Unit Resolution Algorithm for Horn Formulae and Computer Implementation , 1988, Inf. Process. Lett..

[5]  John Doner,et al.  Tree Acceptors and Some of Their Applications , 1970, J. Comput. Syst. Sci..

[6]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[7]  Maurice Bruynooghe,et al.  Information Extraction from Web Documents Based on Local Unranked Tree Automaton Inference , 2003, IJCAI.

[8]  M. Jarke,et al.  Fundamentals of Data Warehouses , 2003, Springer Berlin Heidelberg.

[9]  Thomas Schwentick,et al.  Query automata over finite trees , 2002, Theor. Comput. Sci..

[10]  Bertram Ludäscher,et al.  Managing Semistructured Data with FLORID: A Deductive Object-Oriented Perspective , 1998, Inf. Syst..

[11]  Neil Immerman,et al.  Relational Queries Computable in Polynomial Time , 1986, Inf. Control..

[12]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[13]  Georg Gottlob,et al.  Monadic queries over tree-structured data , 2002, Proceedings 17th Annual IEEE Symposium on Logic in Computer Science.

[14]  Georg Gottlob,et al.  Complexity and expressive power of logic programming , 1997, Proceedings of Computational Complexity. Twelfth Annual IEEE Conference.

[15]  Georg Gottlob,et al.  Efficient Algorithms for Processing XPath Queries , 2002, VLDB.

[16]  Haim Gaifman,et al.  Decidable optimization problems for database logic programs , 1988, STOC '88.

[17]  Christoph Koch,et al.  Query evaluation on compressed trees , 2003, 18th Annual IEEE Symposium of Logic in Computer Science, 2003. Proceedings..

[18]  Ludwig Staiger,et al.  Ω-languages , 1997 .

[19]  Michael R. Genesereth,et al.  Software agents , 1994, CACM.

[20]  Georg Gottlob,et al.  A Formal Comparison of Visual Web Wrapper Generators , 2003, SOFSEM.

[21]  E. Mark Gold,et al.  Language Identification in the Limit , 1967, Inf. Control..

[22]  Jeffrey D. Ullman,et al.  Principles of Database and Knowledge-Base Systems, Volume II , 1988, Principles of computer science series.

[23]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[24]  Christoph Koch,et al.  Efficient Processing of Expressive Node-Selecting Queries on XML Data in Secondary Storage: A Tree Automata-based Approach , 2003, VLDB.

[25]  Moshe Y. Vardi The complexity of relational query languages (Extended Abstract) , 1982, STOC '82.

[26]  Bruno Courcelle,et al.  Graph Rewriting: An Algebraic and Logic Approach , 1991, Handbook of Theoretical Computer Science, Volume B: Formal Models and Sematics.

[27]  Thomas Schwentick,et al.  Numerical document queries , 2003, PODS '03.

[28]  Alberto H. F. Laender,et al.  DEByE - Data Extraction By Example , 2002, Data Knowl. Eng..

[29]  Jeffrey D. Uuman Principles of database and knowledge- base systems , 1989 .

[30]  Jörg Flum,et al.  Query evaluation via tree-decompositions , 2001, JACM.

[31]  Frank Neven,et al.  Expressiveness of structured document query languages based on attribute grammars , 2002, J. ACM.

[32]  Georg Gottlob,et al.  Conjunctive queries over trees , 2004, JACM.

[33]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[34]  Arnaud Sahuguet,et al.  Building intelligent Web applications using lightweight wrappers , 2001, Data Knowl. Eng..

[35]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[36]  Georg Gottlob,et al.  Declarative Information Extraction, Web Crawling, and Recursive Wrapping with Lixto , 2001, LPNMR.

[37]  Harry G. Mairson,et al.  Undecidable optimization problems for database logic programs , 1993, JACM.

[38]  Frank Neven,et al.  Automata theory for XML researchers , 2002, SGMD.

[39]  Oded Shmueli,et al.  Decidability and expressiveness aspects of logic queries , 1987, XP7.52 Workshop on Database Theory.