A highly scalable and effective method for metasearch

A metasearch engine is a system that supports unified access to multiple local search engines. Database selection is one of the main challenges in building a large-scale metasearch engine. The problem is to efficiently and accurately determine a small number of potentially useful local search engines to invoke for each user query. In order to enable accurate selection, metadata that reflect the contents of each search engine need to be collected and used. This article proposes a highly scalable and accurate database selection method. This method has several novel features. First, the metadata for representing the contents of all search engines are organized into a single integrated representative. Such a representative yields both computational efficiency and storage efficiency. Second, the new selection method is based on a theory for ranking search engines optimally. Experimental results indicate that this new method is very effective. An operational prototype system has been built based on the proposed approach.

[1]  W. Bruce Croft,et al.  Cluster-based language models for distributed retrieval , 1999, SIGIR '99.

[2]  Oren Etzioni,et al.  The MetaCrawler architecture for resource aggregation on the Web , 1997 .

[3]  James C. French,et al.  Evaluating database selection techniques: a testbed and experiment , 1998, SIGIR '98.

[4]  James P. Callan,et al.  Effective retrieval with distributed collections , 1998, SIGIR '98.

[5]  Luis Gravano,et al.  Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies , 1995, VLDB.

[6]  Oren Etzioni,et al.  Query routing for Web search engines: architecture and experiments , 2000, Comput. Networks.

[7]  Andrei Z. Broder,et al.  A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.

[8]  Luis Gravano,et al.  Merging Ranks from Heterogeneous Internet Sources , 1997, VLDB.

[9]  King-Lup Liu,et al.  Determining Text Databases to Search in the Internet , 1998, VLDB.

[10]  Oren Etzioni,et al.  Multi-Service Search and Comparison Using the MetaCrawler , 1995 .

[11]  Luis Gravano,et al.  Probe, count, and classify: categorizing hidden web databases , 2001, SIGMOD '01.

[12]  King-Lup Liu,et al.  Estimating the usefulness of search engines , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[13]  King-Lup Liu,et al.  Building efficient and effective metasearch engines , 2002, CSUR.

[14]  Clement T. Yu,et al.  Priniples of Database Query Processing for Advanced Applications , 1997 .

[15]  James C. French,et al.  Comparing the performance of database selection algorithms , 1999, SIGIR '99.

[16]  Guijun Wang,et al.  ProFusion*: Intelligent Fusion from Multiple, Distributed Search Engines , 1996, J. Univers. Comput. Sci..

[17]  James P. Callan,et al.  Automatic discovery of language models for text databases , 1999, SIGMOD '99.

[18]  Ellen M. Voorhees,et al.  Learning collection fusion strategies , 1995, SIGIR '95.

[19]  King-Lup Liu,et al.  A Methodology to Retrieve Text Documents from Multiple Databases , 2002, IEEE Trans. Knowl. Data Eng..

[20]  David Hawking,et al.  Methods for information server selection , 1999, TOIS.

[21]  Giles,et al.  Searching the world wide Web , 1998, Science.

[22]  Michael K. Bergman White Paper: The Deep Web: Surfacing Hidden Value , 2001 .

[23]  C. Lee Giles,et al.  Accessibility of information on the web , 1999, Nature.

[24]  Adele E. Howe,et al.  Experiences with selecting search engines using metasearch , 1997, TOIS.

[25]  Divesh Srivastava,et al.  The Information Manifold , 1995 .

[26]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[27]  King-Lup Liu,et al.  Finding the most similar documents across multiple text databases , 1999, Proceedings IEEE Forum on Research and Technology Advances in Digital Libraries.

[28]  King-Lup Liu,et al.  A Statistical Method for Estimating the Usefulness of Text Databases , 2002, IEEE Trans. Knowl. Data Eng..

[29]  Clement T. Yu,et al.  Concept Hierarchy-Based Text Database Categorization , 2002, Knowledge and Information Systems.

[30]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[31]  King-Lup Liu,et al.  Efficient and effective metasearch for a large number of text databases , 1999, CIKM '99.

[32]  Udi Manber,et al.  The Search Broker , 1997, USENIX Symposium on Internet Technologies and Systems.

[33]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[34]  Dik Lun Lee,et al.  Server Ranking for Distributed Text Retrieval Systems on the Internet , 1997, DASFAA.

[35]  M. I. Mauldin,et al.  Lycos: design choices in an Internet search service , 1997 .

[36]  Steve Kirsch Infoseek's experiences searching the internet , 1998, SIGF.

[37]  W. Meng,et al.  A Methodology for Retrieving Text Documents from Multiple Databases. (submitted for Publication.) Automatic Retrieval with Locality Information Using 6.3.1 Document Fetching 6.2 Similarity Adjustment 5.4 Learning-based Approaches 5.1 Local Determination 5.2 User Determination 5 Select Documents from , 2007 .

[38]  Martin Bergman,et al.  The deep web:surfacing the hidden value , 2000 .

[39]  C. Lee Giles,et al.  Inquirus, the NECI Meta Search Engine , 1998, Comput. Networks.

[40]  Amanda Spink,et al.  Real life information retrieval: a study of user queries on the Web , 1998, SIGF.

[41]  Oren Etzioni,et al.  Multi-Engine Search and Comparison Using the MetaCrawler , 1995, World Wide Web J..

[42]  King-Lup Liu,et al.  Discovering the representative of a search engine , 2001, CIKM '01.

[43]  King-Lup Liu,et al.  Detection of heterogeneities in a multiple text database environment , 1999, Proceedings Fourth IFCIS International Conference on Cooperative Information Systems. CoopIS 99 (Cat. No.PR00384).

[44]  King-Lup Liu,et al.  Efficient and effective metasearch for text databases incorporating linkages among documents , 2001, SIGMOD '01.

[45]  Yizhong Fan,et al.  Adaptive Agents for Information Gathering from Multiple, Distributed Information Sources , 1999 .

[46]  Christoph Baumgarten,et al.  A probabilistic model for distributed information retrieval , 1997, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

[47]  Clement T. Yu,et al.  Concept hierarchy based text database categorization in a metasearch engine environment , 2000, Proceedings of the First International Conference on Web Information Systems Engineering.

[48]  Christoph Baumgarten,et al.  A probabilistic solution to the selection and fusion problem in distributed information retrieval , 1999, SIGIR '99.

[49]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[50]  Ling Liu,et al.  Query routing in large-scale digital library systems , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[51]  Jan O. Pedersen,et al.  Phrase recognition and expansion for short, precision-biased queries based on a query log , 1999, SIGIR '99.

[52]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .