Document assignment in multi-site search engines

Assigning documents accurately to sites is critical for the performance of multi-site Web search engines. In such settings, sites crawl only documents they index and forward queries to obtain best-matching documents from other sites. Inaccurate assignments may lead to inefficiencies when crawling Web pages or processing user queries. In this work, we propose a machine-learned document assignment strategy that uses the locality of document views in search results to decide upon assignments. We evaluate the performance of our strategy using various document features extracted from a large Web collection. Our experimental setup uses query logs from a number of search front-ends spread across different geographic locations and uses these logs to learn the document access patterns. We compare our technique against baselines such as region- and language-based document assignment and observe that our technique achieves substantial performance improvements with respect to recall. With our technique, we are able to obtain a small query forwarding rate (0.04) requiring roughly 45% less replication of documents compared to replicating all documents across all sites.

[1]  Chih-Jen Lin,et al.  Trust Region Newton Method for Logistic Regression , 2008, J. Mach. Learn. Res..

[2]  Aristides Gionis,et al.  On the feasibility of multi-site web search engines , 2009, CIKM.

[3]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[4]  Hans-Dieter Burkhard,et al.  Further studies on the use of negative information in mobile robot localization , 2006, Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006..

[5]  Léon Bottou,et al.  On-line learning for very large data sets , 2005 .

[6]  Hwee Tou Ng,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997, SIGIR '97.

[7]  Berkant Barla Cambazoglu,et al.  On the feasibility of geographically distributed web crawling , 2008, Infoscale.

[8]  Gurmeet Singh Manku,et al.  SETS: search enhanced by topic segmentation , 2003, SIGIR.

[9]  Alessandro Sperduti,et al.  An improved boosting algorithm and its application to text categorization , 2000, CIKM '00.

[10]  Kenneth Ward Church,et al.  On Delivering Embarrassingly Distributed Cloud Services , 2008, HotNets.

[11]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[12]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[13]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[14]  Bernd Bohnet,et al.  Top Accuracy and Fast Dependency Parsing is not a Contradiction , 2010, COLING.

[15]  Ricardo Baeza-Yates,et al.  A Study of the Impact of Index Updates on Distributed Query Processing for Web Search , 2009, ECIR.

[16]  Fabrizio Silvestri,et al.  Design of a Parallel and Distributed Web Search Engine , 2004, ArXiv.

[17]  Kathryn S. McKinley,et al.  Partial collection replication versus caching for information retrieval systems , 2000, SIGIR '00.

[18]  Kathryn S. McKinley,et al.  Partial replica selection based on relevance for information retrieval , 1999, SIGIR '99.

[19]  King-Lup Liu,et al.  Building efficient and effective metasearch engines , 2002, CSUR.

[20]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[21]  Yiming Yang,et al.  A Study of Approaches to Hypertext Categorization , 2002, Journal of Intelligent Information Systems.

[22]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[23]  Berkant Barla Cambazoglu,et al.  Quantifying performance and quality gains in distributed web search engines , 2009, SIGIR.

[24]  Sreerama K. Murthy,et al.  Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey , 1998, Data Mining and Knowledge Discovery.

[25]  Zhichen Xu,et al.  PeerSearch: Efficient Information Retrieval in Peer-to-Peer Networks , 2002 .

[26]  Ricardo Baeza-Yates,et al.  The Geographical Life of Search , 2009, 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology.

[27]  Berkant Barla Cambazoglu,et al.  A refreshing perspective of search engine caching , 2010, WWW '10.

[28]  Berkant Barla Cambazoglu,et al.  Query forwarding in geographically distributed search engines , 2010, SIGIR.

[29]  Raffaele Perego,et al.  Tuning the capacity of search engines: Load-driven routing and incremental caching to reduce and balance the load , 2010, TOIS.

[30]  M. Stone Cross-validation:a review 2 , 1978 .

[31]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[32]  Ricardo Baeza-Yates,et al.  Efficiency trade-offs in two-tier web search systems , 2009, SIGIR.

[33]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[34]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[35]  Jamie Callan,et al.  DISTRIBUTED INFORMATION RETRIEVAL , 2002 .

[36]  Bernd Bohnet,et al.  Very high accuracy and fast dependency parsing is not a contradiction , 2010, COLING 2010.