An Approach to Deep Web Crawling by Sampling

Crawling deep web is the process of collecting data from search interfaces by issuing queries. With wide availability of programmable interface encoded in Web services, deep web crawling has received a large variety of applications. One of the major challenges crawling deep web is the selection of the queries so that most of the data can be retrieved at a low cost. We propose a general method in this regard. In order to minimize the duplicates retrieved, we reduced the problem of selecting an optimal set of queries from a sample of the data source into the well-known set-covering problem and adopt a classical algorithm to resolve it. To verify that the queries selected from a sample also produce a good result for the entire data source, we carried out a set of experiments on large corpora including Wikipedia and Reuters. We show that our sampling-based method is effective by empirically proving that 1) The queries selected from samples can harvest most of the data in the original database; 2) The queries with low overlapping rate in samples will also result in a low overlapping rate in the original database; and 3) The size of the sample and the size of the terms from where to select the queries do not need to be very large.

[1]  Petros Zerfos,et al.  Downloading textual hidden web content through keyword queries , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[2]  David W. Embley,et al.  Query Rewriting for Extracting Data Behind HTML Forms , 2004, ER.

[3]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[4]  Alberto H. F. Laender,et al.  Automatic generation of agents for collecting hidden Web pages for data extraction , 2004, Data Knowl. Eng..

[5]  Mitesh Patel,et al.  Accessing the deep web , 2007, CACM.

[6]  David W. Embley,et al.  Extracting Data behind Web Forms , 2002, ER.

[7]  Otis Gospodnetic,et al.  Lucene in Action , 2004 .

[8]  F ShaalanKhaled,et al.  A Survey of Web Information Extraction Systems , 2006 .

[9]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[10]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[11]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[12]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[13]  Juliana Freire,et al.  Siphoning Hidden-Web Data through Keyword-Based Interfaces , 2010, J. Inf. Data Manag..

[14]  Sourav S. Bhowmick,et al.  DEQUE: querying the deep web , 2005, Data Knowl. Eng..

[15]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[16]  Wei-Ying Ma,et al.  Query Selection Techniques for Efficient Crawling of Structured Web Sources , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[17]  Fidel Cacheda,et al.  Extracting lists of data records from semi-structured web pages , 2008, Data Knowl. Eng..

[18]  Michael K. Bergman White Paper: The Deep Web: Surfacing Hidden Value , 2001 .

[19]  K. Chang,et al.  Accessing the Deep Web : A Survey , 2005 .

[20]  Frank Leymann,et al.  Web Services , 2004, Informatik-Spektrum.