DPro : A Probabilistic Approach for Hidden Web Database Selection Using Dynamic Probing

An ever increasing amount of valuable information is stored in Web databases, “hidden” behind search interfaces. To save the user’s effort in manually exploring each database, metasearchers automatically select the most relevant databases to a user’s query [2, 5, 16, 21, 26]. Existing methods use a pre-collected summary of each database to estimate its “relevancy” to the query, and return the databases with the highest estimation. While this is a great starting point, the existing methods suffer from two drawbacks. First, because the estimation can be inaccurate, the returned databases are often wrong. Second, the system does not try to improve the “quality” of its answer by contacting some databases on-the-fly (to collect more information about the databases and select databases more accruately), even if the user is willing to wait for some time to obtain a better answer. In this paper, we introduce the notion of dynamic probing and study its effectiveness under a probabilistic framework: Under our framework, a user can specify how “correct” the selected databases should be, and our system automatically contacts a few databases to satisfy the user-specified correctness. Our experiments on 20 real hidden Web databases indicate that our approach significantly improves the correctness of the returned databases at a cost of a small number of database probing.

[1]  Jennifer Widom,et al.  The TSIMMIS Project: Integration of Heterogeneous Information Sources , 1994, IPSJ.

[2]  King-Lup Liu,et al.  Efficient and effective metasearch for text databases incorporating linkages among documents , 2001, SIGMOD '01.

[3]  James P. Callan,et al.  Automatic discovery of language models for text databases , 1999, SIGMOD '99.

[4]  Paul Pedley The invisible Web , 2001 .

[5]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[6]  Christoph Baumgarten,et al.  A probabilistic solution to the selection and fusion problem in distributed information retrieval , 1999, SIGIR '99.

[7]  Ronald Fagin,et al.  Combining Fuzzy Information from Multiple Systems , 1999, J. Comput. Syst. Sci..

[8]  David Mason The Invisible Web: Searching the Hidden Parts of the Web , 2002 .

[9]  Luis Gravano,et al.  GlOSS: text-source discovery over the Internet , 1999, TODS.

[10]  King-Lup Liu,et al.  Determining Text Databases to Search in the Internet , 1998, VLDB.

[11]  Chad Carson,et al.  Optimizing queries over multimedia repositories , 1996, SIGMOD '96.

[12]  Peter Bailey,et al.  Server selection on the World Wide Web , 2000, DL '00.

[13]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[14]  Luis Gravano,et al.  Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection , 2002, VLDB.

[15]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[16]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[17]  Dik Lun Lee,et al.  Server Ranking for Distributed Text Retrieval Systems on the Internet , 1997, DASFAA.

[18]  Néstor J. Rodríguez,et al.  Guidelines for designing usable World Wide Web pages , 1996, CHI Conference Companion.

[19]  James P. Callan,et al.  Effective retrieval with distributed collections , 1998, SIGIR '98.

[20]  Seung-won Hwang,et al.  Minimal probing: supporting expensive predicates for top-k queries , 2002, SIGMOD '02.

[21]  The Effectiveness of GlOSS for the Text Database Discovery Problem , 1994, SIGMOD Conference.

[22]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[23]  Luis Gravano,et al.  Evaluating Top-k Selection Queries , 1999, VLDB.

[24]  Luis Gravano,et al.  Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies , 1995, VLDB.