Crawling Deep Web Using a New Set Covering Algorithm

Crawling the deep web often requires the selection of an appropriate set of queries so that they can cover most of the documents in the data source with low cost. This can be modeled as a set covering problem which has been extensively studied. The conventional set covering algorithms, however, do not work well when applied to deep web crawling due to various special features of this application domain. Typically, most set covering algorithms assume the uniform distribution of the elements being covered, while for deep web crawling, neither the sizes of documents nor the document frequencies of the queries is distributed uniformly. Instead, they follow the power law distribution. Hence, we have developed a new set covering algorithm that targets at web crawling. Compared to our previous deep web crawling method that uses a straightforward greedy set covering algorithm, it introduces weights into the greedy strategy. Our experiment carried out on a variety of corpora shows that this new method consistently outperforms its un-weighted version.

[1]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[2]  David W. Embley,et al.  Extracting Data behind Web Forms , 2002, ER.

[3]  Juliana Freire,et al.  Siphoning Hidden-Web Data through Keyword-Based Interfaces , 2010, J. Inf. Data Manag..

[4]  Otis Gospodnetic,et al.  Lucene in Action , 2004 .

[5]  Ho-Jin Choi,et al.  Addressing Effective Hidden Web Search Using Iterative Deepening Search and Graph Theory , 2008, 2008 IEEE 8th International Conference on Computer and Information Technology Workshops.

[6]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[7]  Petros Zerfos,et al.  Downloading textual hidden web content through keyword queries , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[8]  Fidel Cacheda,et al.  Extracting lists of data records from semi-structured web pages , 2008, Data Knowl. Eng..

[9]  Jianguo Lu,et al.  An Approach to Deep Web Crawling by Sampling , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[10]  Eric Yu,et al.  Advanced Conceptual Modeling Techniques , 2002, Lecture Notes in Computer Science.

[11]  Matteo Fischetti,et al.  Algorithms for the Set Covering Problem , 2000, Ann. Oper. Res..

[12]  Michael L. Nelson,et al.  Efficient, automatic web resource harvesting , 2006, WIDM '06.

[13]  Luis Gravano,et al.  Towards a query optimizer for text-centric tasks , 2007, TODS.

[14]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[15]  Ling Liu,et al.  Probe, cluster, and discover: focused extraction of QA-Pagelets from the deep Web , 2004, Proceedings. 20th International Conference on Data Engineering.

[16]  Wei-Ying Ma,et al.  Query Selection Techniques for Efficient Crawling of Structured Web Sources , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[17]  Michael K. Bergman White Paper: The Deep Web: Surfacing Hidden Value , 2001 .