Fully automatic wrapper generation for search engines

When a query is submitted to a search engine, the search engine returns a dynamically generated result page containing the result records, each of which usually consists of a link to and/or snippet of a retrieved Web page. In addition, such a result page often also contains information irrelevant to the query, such as information related to the hosting site of the search engine and advertisements. In this paper, we present a technique for automatically producing wrappers that can be used to extract search result records from dynamically generated result pages returned by search engines. Automatic search result record extraction is very important for many applications that need to interact with search engines such as automatic construction and maintenance of metasearch engines and deep Web crawling. The novel aspect of the proposed technique is that it utilizes both the visual content features on the result page as displayed on a browser and the HTML tag structures of the HTML source file of the result page. Experimental results indicate that this technique can achieve very high extraction accuracy.

[1]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[2]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[3]  Veljko M. Milutinovic,et al.  Recognition of common areas in a Web page using visual information: a possible application in a page classification , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[4]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[5]  E. Mark Gold,et al.  Language Identification in the Limit , 1967, Inf. Control..

[6]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[7]  Chia-Hui Chang,et al.  IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[8]  Wei-Ying Ma,et al.  Visual Based Content Understanding towards Web Adaptation , 2002, AH.

[9]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[10]  M. de Rijke,et al.  Automatic Wrapper Generation for Web Search Engines , 2000, Web-Age Information Management.

[11]  Udi Manber,et al.  Fast text searching: allowing errors , 1992, CACM.

[12]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[13]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[14]  Yu Chen,et al.  Html Page Analysis based on Visual cues , 2003, Web Document Analysis.

[15]  M. Crochemore,et al.  On-line construction of suffix trees , 2002 .

[16]  Mitesh Patel,et al.  Structured databases on the web: observations and implications , 2004, SGMD.

[17]  Frederick H. Lochovsky,et al.  Data extraction and label assignment for web databases , 2003, WWW '03.

[18]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[19]  Vijay V. Raghavan,et al.  Towards automatic incorporation of search engines into a large-scale metasearch engine , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[20]  Calton Pu,et al.  A fully automated object extraction system for the World Wide Web , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[21]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[22]  King-Lup Liu,et al.  Building efficient and effective metasearch engines , 2002, CSUR.

[23]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[24]  Nan Wang,et al.  Automatic composite wrapper generation for semi-structured biological data based on table structure identification , 2004, SGMD.

[25]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[26]  David W. Embley,et al.  Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[27]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .