SXPath - Extending XPath towards Spatial Querying on Web Documents

Querying data from presentation formats like HTML, for purposes such as information extraction, requires the consideration of tree structures as well as the consideration of spatial relationships between laid out elements. The underlying rationale is that frequently the rendering of tree structures is very involved and undergoing more frequent updates than the resulting layout structure. Therefore, in this paper, we present Spatial XPath (SXPath), an extension of XPath 1.0 that allows for inclusion of spatial navigation primitives into the language resulting in conceptually simpler queries on Web documents. The SXPath language is based on a combination of a spatial algebra with formal descriptions of XPath navigation, and maintains polynomial time combined complexity. Practical experiments demonstrate the usability of SXPath.

[1]  Dimitrios Skoutas,et al.  STAVIES: a system for information extraction from unknown Web data sources through automatic Web wrapper generation using clustering techniques , 2005, IEEE Transactions on Knowledge and Data Engineering.

[2]  P. Wadler Two semantics for XPath , 2000 .

[3]  V. S. Subrahmanian,et al.  An algebra for creating and querying multimedia presentations , 2000, Multimedia Systems.

[4]  Georg Gottlob,et al.  The Lixto data extraction project: back and forth between theory and practice , 2004, PODS.

[5]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[6]  Xuemin Lin,et al.  How to draw a directed graph , 1989, [Proceedings] 1989 IEEE Workshop on Visual Languages.

[7]  Georg Gottlob,et al.  Scalable Web Data Extraction for Online Market Intelligence , 2009, Proc. VLDB Endow..

[8]  Wolfgang Gatterbauer,et al.  Towards domain-independent information extraction from web tables , 2007, WWW '07.

[9]  Jochen Renz,et al.  Qualitative Spatial Reasoning with Topological Information , 2002, Lecture Notes in Computer Science.

[10]  Jun Kong,et al.  Spatial graph grammars for graphical user interfaces , 2006, TCHI.

[11]  Luis Fariñas del Cerro,et al.  A New Tractable Subclass of the Rectangle Algebra , 1999, IJCAI.

[12]  Leonid Libkin,et al.  Elements of Finite Model Theory , 2004, Texts in Theoretical Computer Science.

[13]  Bing Liu,et al.  Structured Data Extraction from the Web Based on Partial Tree Alignment , 2006, IEEE Transactions on Knowledge and Data Engineering.

[14]  James F. Allen Maintaining knowledge about temporal intervals , 1983, CACM.

[15]  Steven J. DeRose,et al.  XML Path Language (XPath) , 1999 .

[16]  Pawel Parys,et al.  XPath evaluation in linear time with polynomial combined complexity , 2009, PODS.

[17]  M. de Rijke,et al.  Semantic characterizations of navigational XPath , 2005, SGMD.

[18]  Leonid Libkin,et al.  Elements Of Finite Model Theory (Texts in Theoretical Computer Science. An Eatcs Series) , 2004 .

[19]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[20]  Arnaud Sahuguet,et al.  Building intelligent Web applications using lightweight wrappers , 2001, Data Knowl. Eng..

[21]  Guido Sciavicco,et al.  Spatial Reasoning with Rectangular Cardinal Direction Relations 1 , 2006 .

[22]  Steven J. DeRose,et al.  XML Path Language (XPath) Version 1.0 , 1999 .

[23]  Gultekin Özsoyoglu,et al.  Querying Multimedia Presentations Based on Content , 1999, IEEE Trans. Knowl. Data Eng..

[24]  Michael Benedikt,et al.  XPath leashed , 2009, CSUR.

[25]  Maarten Marx,et al.  Axiomatizing the Logical Core of XPath 2.0 , 2008, Theory of Computing Systems.