Web Data Extraction Based on Tree Structure Analysis and Template Generation

This paper studies the problem of extracting data from large numbers of semi-structured web pages. The fact that many websites have enormous pages generated dynamically from a underlying structured source like a database makes it feasible to induct a common template for similar web pages and then extract data accordingly. Previous work on this problem has limited practical utility because of either requiring significant human efforts or basing on several brittle assumptions. We propose a three-step approach, including template generation, template detection and data extraction, with a little human intervention in template edit. The core algorithm is based on two highly efficient tree structure analysis techniques. Experimental results show that our approach can extract web data in a high accuracy and flexibility.