论文信息 - Automatically Extracting Subsequent Response Pages from Web Search Sources

Automatically Extracting Subsequent Response Pages from Web Search Sources

Usually, when Web search sources such as search engines and deep Websites retrieve too many result records for a given query, they split them among several pages with, say, ten or twenty records on each page and return only the page that has the top ranked records. This page usually provides one or more hyperlinks or buttons pointing to one or more of the remaining response pages (called subsequent response pages), which inturn contain similar hyperlinks or buttons to allow users to navigate from one page to another. Information integration systems often need to access these subsequent response pages to extract the records contained in them. However, hyperlinks or buttons pointing to subsequent response pages are often displayed in different formats by different Web search sources. Due to this it becomes a challenging task to automatically identify these hyperlinks or buttons and extract the response pages referenced by them. In this paper, we propose a novel solution to automatically fetch any specified response page from autonomous and heterogeneous Web search sources for any given query. Our approach first identifies certain important hyperlinks present in the response page sampled from an input Web search source and then further analyzes them using four heuristics. Finally a wrapper is built to automatically extract any specified response page from the input source.

[1] Kevin Chen-Chuan Chang,et al. Editorial: special issue on web content mining , 2004, SKDD.

[2] Chaomei Chen,et al. Mining the Web: Discovering knowledge from hypertext data , 2004, J. Assoc. Inf. Sci. Technol..

[3] Vijay V. Raghavan,et al. Towards automatic incorporation of search engines into a large-scale metasearch engine , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[4] King-Lup Liu,et al. Building efficient and effective metasearch engines , 2002, CSUR.

[5] Sriram Raghavan,et al. Crawling the Hidden Web , 2001, VLDB.

[6] Graham A Stephen,et al. Approximate String Matching , 1994, Encyclopedia of Algorithms.

[7] Calton Pu,et al. A fully automated object extraction system for the World Wide Web , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[8] Vijay V. Raghavan,et al. Fully automatic wrapper generation for search engines , 2005, WWW '05.

[9] Chia-Hui Chang,et al. IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[10] Craig A. Knoblock,et al. A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[11] Oren Etzioni,et al. A scalable comparison-shopping agent for the World-Wide Web , 1997, AGENTS '97.

[12] Robert L. Grossman,et al. Mining data records in Web pages , 2003, KDD '03.

[13] Vijay V. Raghavan,et al. Automatically Detecting Boolean Operations Supported by Search Engines, Towards Search Engine Query Language Discovery , 2004, Workshop on Web-based Support Systems.

[14] Nan Wang,et al. Automatic composite wrapper generation for semi-structured biological data based on table structure identification , 2004, SGMD.