Crawling Ranked Deep Web Data Sources

In the era of big data, the vast majority of the data are not from the surface web, the web that is interconnected by hyperlinks and indexed by most general purpose search engines. Instead, the trove of valuable data often reside in the deep web, the web that is hidden behind query interfaces. Since the data in the deep web are often of high value, there is a line of research on crawling deep web data sources in the recent decade. However, most existing crawling methods assume that all the matched documents are returned. In practice, many data sources rank the matched documents, and return only the top k matches. When conventional methods are applied on such ranked data sources, popular queries that matches more than k documents will cause large redundancy. This paper proposes the document frequency df based algorithm that exploits the queries whose document frequencies are within the specified range. The algorithm is extensively tested on a variety of datasets and compared with existing two algorithms. We demonstrate that our method outperforms the two algorithms 58i¾?% and 90i¾?% on average respectively.

[1]  Jianguo Lu,et al.  An Approach to Deep Web Crawling by Sampling , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[2]  Wei-Ying Ma,et al.  Query Selection Techniques for Efficient Crawling of Structured Web Sources , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[3]  Loredana Afanasiev,et al.  Harnessing the Deep Web: Present and Future , 2009, CIDR.

[4]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[5]  Luis Gravano,et al.  Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection , 2002, VLDB.

[6]  Michael K. Bergman White Paper: The Deep Web: Surfacing Hidden Value , 2001 .

[7]  Mitesh Patel,et al.  Accessing the deep web , 2007, CACM.

[8]  Jianguo Lu,et al.  TS-IDS Algorithm for Query Selection in the Deep Web Crawling , 2014, APWeb.

[9]  Petros Zerfos,et al.  Downloading textual hidden web content through keyword queries , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[10]  David W. Embley,et al.  Extracting Data behind Web Forms , 2002, ER.

[11]  Sourav S. Bhowmick,et al.  DEQUE: querying the deep web , 2005, Data Knowl. Eng..

[12]  Juliana Freire,et al.  Siphoning Hidden-Web Data through Keyword-Based Interfaces , 2010, J. Inf. Data Manag..

[13]  Ziv Bar-Yossef,et al.  Random sampling from a search engine's index , 2006, WWW '06.

[14]  I. J. Myung,et al.  Tutorial on maximum likelihood estimation , 2003 .

[15]  Divesh Srivastava,et al.  Big data integration , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[16]  William A. Gale,et al.  Good-Turing Frequency Estimation Without Tears , 1995, J. Quant. Linguistics.

[17]  Martin Bergman,et al.  The deep web:surfacing the hidden value , 2000 .

[18]  Qingzhong Li,et al.  A Deep Web Crawling Approach Based on Query Harvest Model , 2012 .

[19]  Qinghua Zheng,et al.  Efficient Deep Web Crawling Using Reinforcement Learning , 2010, PAKDD.

[20]  Yeye He,et al.  Crawling deep web entity pages , 2013, WSDM.

[21]  Qinghua Zheng,et al.  Learning to crawl deep web , 2013, Inf. Syst..

[22]  Qinghua Zheng,et al.  Learning Deep Web Crawling with Diverse Features , 2009, 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology.

[23]  Jayant Madhavan,et al.  Google's Deep Web crawl , 2008, Proc. VLDB Endow..

[24]  Luis Gravano,et al.  Probe, count, and classify: categorizing hidden web databases , 2001, SIGMOD '01.

[25]  Haixun Wang,et al.  Optimizing content freshness of relations extracted from the web using keyword search , 2010, SIGMOD Conference.

[26]  Jianguo Lu,et al.  Crawling Deep Web Using a New Set Covering Algorithm , 2009, ADMA.

[27]  Jianguo Lu,et al.  Ranking bias in deep web size estimation using capture recapture method , 2010, Data Knowl. Eng..