ViWER- data extraction for search engine results pages using visual cue and DOM Tree

Visual wrappers use visual information in addition to the DOM Tree properties in the extraction of data records. The important feature of a visual assisted wrapper is the use of the bounding box of HTML tag to detect relevant data region which contains the required data records. However, a closer look indicates that additional visual cue such as the size of bounding box can be used to check the similarity of data records. In this paper, we present two main features of our algorithm in data extraction. We develop a tree matching algorithm to check the similarity of data records. This simplifies the complicated process of a full tree matching algorithm. We also use the size of bounding box to further improve the similarity check of data records. Our study shows that using the size of text and image in a wrapper design can improve the accuracy in detecting the correct data region (search results output from search engine results pages). Results show that our wrapper is highly effective in data extraction.