Extracting Data Records from Query Result Pages Based on Visual Features

Web databases contain a large amount of structured data which are accessible via their query interfaces only. Query results are presented in dynamically generated web pages, usually in the form of data records, for human use. The problem of automatically extracting data records from query result pages is critical for web data integration applications, such as comparison shopping sites, meta-search engines, etc. A number of approaches to query result extraction have been proposed. As the structures of web pages become more complex, these approaches start to fail. Query result pages usually also contain other types of information in addition to query results, e.g., advertisements, navigation bar, etc. Most of the existing approaches do not remove such irrelevant contents which may affect the accuracy of data record extraction. We have observed that query results are usually displayed in regular visual patterns and terms used in a query often re-appear in query results. We propose a novel approach that makes use of visual features and query terms to identify the data section and extract data records from it. We also use several content and visual features of visual blocks in a data section to filter out noisy blocks. The results of our experiments on a large set of query result pages in different domains show that our proposed approach is highly effective.

[1]  Vijay V. Raghavan,et al.  Fully automatic wrapper generation for search engines , 2005, WWW '05.

[2]  Bing Liu,et al.  Structured Data Extraction from the Web Based on Partial Tree Alignment , 2006, IEEE Transactions on Knowledge and Data Engineering.

[3]  Frederick H. Lochovsky,et al.  Data extraction and label assignment for web databases , 2003, WWW '03.

[4]  Wei-Ying Ma,et al.  Extracting Content Structure for Web Pages Based on Visual Representation , 2003, APWeb.

[5]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[6]  Wolfgang Gatterbauer,et al.  Table Extraction Using Spatial Reasoning on the CSS2 Visual Box Model , 2006, AAAI.

[7]  Wei-Ying Ma,et al.  Instance-based Schema Matching for Web Databases by Domain-specific Query Probing , 2004, VLDB.

[8]  Bing Liu,et al.  NET - A System for Extracting Web Data from Flat and Nested Data Records , 2005, WISE.

[9]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[10]  Wei-Ying Ma,et al.  Simultaneous record detection and attribute labeling in web data extraction , 2006, KDD '06.

[11]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[12]  Georg Lausen,et al.  ViPER: augmenting automatic information extraction with visual perceptions , 2005, CIKM '05.

[13]  Wolfgang Gatterbauer,et al.  Towards domain-independent information extraction from web tables , 2007, WWW '07.

[14]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[15]  Louise E. Moser,et al.  Extracting data records from the web using tag path clustering , 2009, WWW '09.

[16]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[17]  Clement T. Yu,et al.  Automatic extraction of dynamic record sections from search engine result pages , 2006, VLDB.

[18]  Chia-Hui Chang,et al.  IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[19]  Anne H. H Ngu,et al.  Web Information Systems Engineering - WISE 2005, 6th International Conference on Web Information Systems Engineering, New York, NY, USA, November 20-22, 2005, Proceedings , 2005, WISE.

[20]  Wei Liu,et al.  ViDE: A Vision-Based Approach for Deep Web Data Extraction , 2010, IEEE Transactions on Knowledge and Data Engineering.

[21]  James A. M. McHugh,et al.  Mining the World Wide Web , 2001, The Information Retrieval Series.