SXPath: a Spatial Extension of XPath

We report on a recently introduced extension of XPath, called SXPath, which is a new framework for querying Web documents by considering tree structures as well as spatial relationships between laid out elements. The underlying rationale is that frequently the rendering of tree structures is very involved and undergoing more frequent updates than the resulting layout structure. In this paper, we present the syntax and the semantics of the language that are based on a combination of a spatial algebra with formal descriptions of XPath navigation. Such language is intuitive and general enough to capture most frequent extraction patterns. Moreover, we show that the language maintains polynomial time combined complexity. Practical experiments demonstrate the usability of SXPath. This work is a short version of [11].

[1]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[2]  Georg Gottlob,et al.  Efficient Algorithms for Processing XPath Queries , 2002, VLDB.

[3]  Steffen Staab,et al.  SXPath - Extending XPath towards Spatial Querying on Web Documents , 2010, Proc. VLDB Endow..

[4]  Jochen Renz,et al.  Qualitative Spatial Reasoning with Topological Information , 2002, Lecture Notes in Computer Science.

[5]  Jun Kong,et al.  Spatial graph grammars for graphical user interfaces , 2006, TCHI.

[6]  Bing Liu,et al.  Structured Data Extraction from the Web Based on Partial Tree Alignment , 2006, IEEE Transactions on Knowledge and Data Engineering.

[7]  Steven J. DeRose,et al.  XML Path Language (XPath) Version 1.0 , 1999 .

[8]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[9]  Luis Fariñas del Cerro,et al.  A New Tractable Subclass of the Rectangle Algebra , 1999, IJCAI.

[10]  Gultekin Özsoyoglu,et al.  Querying Multimedia Presentations Based on Content , 1999, IEEE Trans. Knowl. Data Eng..

[11]  Arnaud Sahuguet,et al.  Building intelligent Web applications using lightweight wrappers , 2001, Data Knowl. Eng..

[12]  Guido Sciavicco,et al.  Spatial Reasoning with Rectangular Cardinal Direction Relations 1 , 2006 .

[13]  Georg Gottlob,et al.  The Lixto data extraction project: back and forth between theory and practice , 2004, PODS.

[14]  Dimitrios Skoutas,et al.  STAVIES: a system for information extraction from unknown Web data sources through automatic Web wrapper generation using clustering techniques , 2005, IEEE Transactions on Knowledge and Data Engineering.

[15]  V. S. Subrahmanian,et al.  An algebra for creating and querying multimedia presentations , 2000, Multimedia Systems.

[16]  James F. Allen Maintaining knowledge about temporal intervals , 1983, CACM.