XTreePath: A generalization of XPath to handle real world structural variation

We discuss a key problem in information extraction which deals with wrapper failures due to changing content templates. A good proportion of wrapper failures are due to HTML templates changing to cause wrappers to become incompatible after element inclusion or removal in a DOM (Tree representation of HTML). We perform a large-scale empirical analyses of the causes of shift and mathematically quantify the levels of domain difficulty based on entropy. We propose the XTreePath annotation method to captures contextual node information from the training DOM. We then utilize this annotation in a supervised manner at test time with our proposed Recursive Tree Matching method which locates nodes most similar in context recursively using the tree edit distance. The search is based on a heuristic function that takes into account the similarity of a tree compared to the structure that was present in the training data. We evaluate XTreePath using 117,422 pages from 75 diverse websites in 8 vertical markets. Our XTreePath method consistently outperforms XPath and a current commercial system in terms of successful extractions in a blackbox test. We make our code and datasets publicly available online.

[1]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[2]  Qiang Hao,et al.  From one tree to a forest: a unified solution for structured web data extraction , 2011, SIGIR.

[3]  Maurice Bruynooghe,et al.  Information Extraction in Structured Documents Using Tree Automata Induction , 2002, PKDD.

[4]  Elio Masciari,et al.  Web wrapper induction: a brief survey , 2004, AI Commun..

[5]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[6]  Eran Yahav,et al.  Synthesis of Forgiving Data Extractors , 2017, WSDM.

[7]  Wuu Yang,et al.  Identifying syntactic differences between two programs , 1991, Softw. Pract. Exp..

[8]  Nilesh N. Dalvi,et al.  Robust web extraction: an approach based on a probabilistic tree-edit model , 2009, SIGMOD Conference.

[9]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[10]  Bing Liu,et al.  A Generalized Tree Matching Algorithm Considering Nested Lists for Web Data Extraction , 2010, SDM.

[11]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[12]  Ji-Rong Wen,et al.  Efficient record-level wrapper induction , 2009, CIKM.

[13]  Tobias Anton XPath-Wrapper Induction by generating tree traversal patterns , 2005, LWA.

[14]  Ravi Kumar,et al.  Automatic Wrappers for Large Scale Web Extraction , 2011, Proc. VLDB Endow..

[15]  Reynold Cheng,et al.  STEM: a suffix tree-based method for web data records extraction , 2018, Knowledge and Information Systems.

[16]  Boris Chidlovskii Information Extraction from Tree Documents by Learning Subtree Delimiters , 2003, IIWeb.

[17]  Rajeev Rastogi,et al.  Web-scale information extraction with vertex , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[18]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[19]  Aditya G. Parameswaran,et al.  Optimal schemes for robust web extraction , 2011, Proc. VLDB Endow..