report discussed how to estimate the database usefulness deened in this paper for the high-correlation and disjoint scenarios. Such discussions did not appear in 18].) a collection of up to more than 6,000 real Internet queries is used. However, the database collection is small and there is no document relevance information. An ideal testbed should have a large collection of databases of various sizes, contents and structures, a large collection of queries of various lengths and with the relevant documents for each query identiied. A recently proposed testbed for evaluating metasearch techniques 14] is far from being an ideal testbed. The above list of challenges is by no means complete. New problems will arise with more understanding of the issues in metasearch. 8 Conclusions With the increase of the number of search engines and digital libraries on the World Wide Web, providing easy, eecient and eeective access to text information from multiple sources has increasingly become necessary. In this article, we presented an overview of existing metasearch techniques. Our overview concentrated on the problems of database selection, document selection and result merging. A wide variety of techniques for each of these problems were surveyed and compared. We also discussed the underlying causes for these problems to be so challenging: various hetero-geneities among diierent local search engines due to the independent implementations of these search engines, and the lack of information about these implementations because they are mostly proprietary. Our survey and investigation seem to indicate that there may not be a single best solution for any of the main problems focused in this article, namely, the database selection problem, the document selection problem and the result merging problem. Better solutions often require more information from local search engines such as more detailed database representatives, underlying similarity functions, term weighting schemes, indexing methods, and so on. There are currently no suuciently eecient methods to nd such information independently. A possible scenario is that we will need good solutions based on diierent degrees of knowledge about each local search engine and then apply these solutions accordingly. Another important issue is the scalability of the solutions. Ultimately, we need to develop solutions that can scale in two orthogonal dimensions: data and access. Speciically, a good solution must scale to thousands of databases with many of them containing millions of documents and to millions of accesses a day. None of the proposed solutions have …
[1]
Edward A. Fox,et al.
Combination of Multiple Searches
,
1993,
TREC.
[2]
Dayne Freitag,et al.
A Machine Learning Architecture for Optimizing Web Search Engines
,
1999
.
[3]
Abdulla Ghaleb,et al.
Characterizing World Wide Web Queries
,
1997
.
[4]
Christoph Baumgarten,et al.
A probabilistic model for distributed information retrieval
,
1997,
Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.
[5]
W. Bruce Croft,et al.
Searching distributed collections with inference networks
,
1995,
SIGIR '95.
[6]
Susan T. Dumais,et al.
Latent Semantic Indexing (LSI) and TREC-2
,
1993,
TREC.
[7]
James P. Callan,et al.
Automatic discovery of language models for text databases
,
1999,
SIGMOD '99.
[8]
Anil S. Chakravarthy,et al.
NetSerf: using semantic knowledge to find Internet information archives
,
1995,
SIGIR '95.
[9]
Weiyi Meng,et al.
Using the Structure of HTML Documents to Improve Retrieval
,
1997,
USENIX Symposium on Internet Technologies and Systems.
[10]
Sergey Brin,et al.
The Anatomy of a Large-Scale Hypertextual Web Search Engine
,
1998,
Comput. Networks.
[11]
James Allan,et al.
Automatic Retrieval With Locality Information Using SMART
,
1992,
TREC.
[12]
Adele E. Howe,et al.
Experiences with selecting search engines using metasearch
,
1997,
TOIS.