Query-Based Sampling: Can we do Better than Random?

Many servers on the web oer content that is only accessible via a search interface. These are part of the deep web. Using conventional crawling to index the content of these remote servers is impossible without some form of cooperation. Query-based sampling provides an alternative to crawling requiring no cooperation beyond a basic search interface. In this approach, conventionally, random queries are sent to a server to obtain a sample of documents of the underlying collection. The sample represents the entire server content. This representation is called a resource description. In this research we explore if better resource descriptions can be obtained by using alternative query construction strategies. The results indicate that randomly choosing queries from the vocabulary of sampled documents is indeed a good strategy. However, we show that, when sampling a large collection, using the least frequent terms in the sample yields a better resource description than using randomly chosen terms.

[1]  Ryan Womack,et al.  Introduction to R , 2010, IASSIST.

[2]  Kevin Chen-Chuan Chang,et al.  Toward Large Scale Integration: Building a MetaQuerier over Databases on the Web , 2005, CIDR.

[3]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[4]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[5]  Fabio Crestani,et al.  Towards better measures: evaluation of estimated resource description quality for distributed IR , 2006, InfoScale '06.

[6]  Mitesh Patel,et al.  Accessing the deep web , 2007, CACM.

[7]  Ziv Bar-Yossef,et al.  Random sampling from a search engine's index , 2006, WWW '06.

[8]  Donna K. Harman,et al.  Overview of the Third Text REtrieval Conference (TREC-3) , 1995, TREC.

[9]  Jamie Callan,et al.  Probing a Collection to Discover Its Language Model , 1998 .

[10]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[11]  James P. Callan,et al.  Automatic discovery of language models for text databases , 1999, SIGMOD '99.

[12]  Luis Gravano,et al.  Classification-aware hidden-web text database selection , 2008, TOIS.

[13]  Pierre Senellart,et al.  Automatic wrapper induction from hidden-web sources with domain knowledge , 2008, WIDM '08.

[14]  Ido Dagan,et al.  Similarity-Based Methods for Word Sense Disambiguation , 1997, ACL.

[15]  Fabio Crestani,et al.  Adaptive Query-Based Sampling of Distributed Collections , 2006, SPIRE.

[16]  Boris Chidlovskii,et al.  Using Query Probing to Identify Query Language Features on the Web , 2003, Distributed Multimedia Information Retrieval.

[17]  Jamie Callan,et al.  DISTRIBUTED INFORMATION RETRIEVAL , 2002 .

[18]  D. K. Harmon,et al.  Overview of the Third Text Retrieval Conference (TREC-3) , 1996 .

[19]  Mitesh Patel,et al.  Structured databases on the web: observations and implications , 2004, SGMD.

[20]  Luis Gravano,et al.  Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection , 2002, VLDB.

[21]  Peter Bailey,et al.  Overview of the TREC-8 Web Track , 2000, TREC.

[22]  Qigang Gao,et al.  Using controlled query generation to evaluate blind relevance feedback algorithms , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[23]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[24]  James C. French,et al.  Obtaining language models of web collections using query-based sampling techniques , 2002, Proceedings of the 35th Annual Hawaii International Conference on System Sciences.