Semi-Supervised Web Wrapper Repair via Recursive Tree Matching

Continuous data extraction pipelines using wrappers have become common and integral parts of businesses dealing with stock, flight, or product information. Extracting data from websites that use HTML templates is difficult because available wrapper methods are not designed to deal with websites that change over time (the inclusion or removal of HTML elements). We are the first to perform large scale empirical analyses of the causes of shift and propose the concept of domain entropy to quantify it. We draw from this analysis to propose a new semi-supervised search approach called XTPath. XTPath combines the existing XPath with carefully designed annotation extraction and informed search strategies. XTPath is the first method to store contextual node information from the training DOM and utilize it in a supervised manner. We utilize this data with our proposed recursive tree matching method which locates nodes most similar in context. The search is based on a heuristic function that takes into account the similarity of a tree compared to the structure that was present in the training data. We systematically evaluate XTPath using 117,422 pages from 75 diverse websites in 8 vertical markets that covers vastly different topics. Our XTPath method consistently outperforms XPath and a current commercial system in terms of successful extractions in a blackbox test. We are the first supervised wrapper extraction method to make our code and datasets available (online here: this http URL).

[1]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[2]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[3]  Tobias Anton XPath-Wrapper Induction by generating tree traversal patterns , 2005, LWA.

[4]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[5]  Nilesh N. Dalvi,et al.  Robust web extraction: an approach based on a probabilistic tree-edit model , 2009, SIGMOD Conference.

[6]  Qiang Hao,et al.  From one tree to a forest: a unified solution for structured web data extraction , 2011, SIGIR.

[7]  Elio Masciari,et al.  Web wrapper induction: a brief survey , 2004, AI Commun..

[8]  Maurice Bruynooghe,et al.  Information Extraction in Structured Documents Using Tree Automata Induction , 2002, PKDD.

[9]  Wuu Yang,et al.  Identifying syntactic differences between two programs , 1991, Softw. Pract. Exp..

[10]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[11]  Rajeev Rastogi,et al.  Web-scale information extraction with vertex , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[12]  Aditya G. Parameswaran,et al.  Optimal schemes for robust web extraction , 2011, Proc. VLDB Endow..

[13]  Ravi Kumar,et al.  Automatic Wrappers for Large Scale Web Extraction , 2011, Proc. VLDB Endow..

[14]  Boris Chidlovskii Information Extraction from Tree Documents by Learning Subtree Delimiters , 2003, IIWeb.

[15]  Ji-Rong Wen,et al.  Efficient record-level wrapper induction , 2009, CIKM.

[16]  Bing Liu,et al.  A Generalized Tree Matching Algorithm Considering Nested Lists for Web Data Extraction , 2010, SDM.