Using Evolutive Summary Counters for Efficient Cooperative Caching in Search Engines

We propose and analyze a distributed cooperative caching strategy based on the Evolutive Summary Counters (ESC), a new data structure that stores an approximated record of the data accesses in each computing node of a search engine. The ESC capture the frequency of accesses to the elements of a data collection, and the evolution of the access patterns for each node in a network of computers. The ESC can be efficiently summarized into what we call ESC-summaries to obtain approximate statistics of the document entries accessed by each computing node. We use the ESC-summaries to introduce two algorithms that manage our distributed caching strategy, one for the distribution of the cache contents, ESC-placement, and another one for the search of documents in the distributed cache, ESC-search. While the former improves the hit rate of the system and keeps a large ratio of data accesses local, the latter reduces the network traffic by restricting the number of nodes queried to find a document. We show that our cooperative caching approach outperforms state-of-the-art models in both hit rate, throughput, and location recall for multiple scenarios, i.e., different query distributions and systems with varying degrees of complexity.

[1]  Weiguo Fan,et al.  Beyond keywords: Automated question answering on the web , 2008, CACM.

[2]  Fabrizio Silvestri,et al.  Query-driven document partitioning and collection selection , 2006, InfoScale '06.

[3]  Anna R. Karlin,et al.  Implementing global memory management in a workstation cluster , 1995, SOSP.

[4]  Wagner Meira,et al.  Rank-preserving two-level caching for scalable search engines , 2001, SIGIR '01.

[5]  Antony I. T. Rowstron,et al.  Squirrel: a decentralized peer-to-peer web cache , 2002, PODC '02.

[6]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[7]  Lucian Vlad Lita,et al.  JAVELIN I and II Systems at TREC 2005 , 2005, TREC.

[8]  Josep-Lluís Larriba-Pey,et al.  Dynamic count filters , 2006, SGMD.

[9]  Jennifer Chu-Carroll,et al.  IBM's PIQUANT II in TREC2005 , 2005 .

[10]  Gade Krishna,et al.  A scalable peer-to-peer lookup protocol for Internet applications , 2012 .

[11]  LiuLing,et al.  An Expiration Age-Based Document Placement Scheme for Cooperative Web Caching , 2004 .

[12]  Mihai Surdeanu,et al.  Design and performance analysis of a factoid question answering system for spontaneous speech transcriptions , 2006, INTERSPEECH.

[13]  Michael Dahlin,et al.  Cooperative caching: using remote client memory to improve file system performance , 1994, OSDI '94.

[14]  G. Voelker,et al.  On the scale and performance of cooperative Web proxy caching , 2000, OPSR.

[15]  Erhard Rahm Parallel query processing in shared disk database systems , 1993, SGMD.

[16]  John H. Hartman,et al.  Efficient cooperative caching using hints , 1996, OSDI '96.

[17]  Li Fan,et al.  Summary cache: a scalable wide-area web cache sharing protocol , 2000, TNET.

[18]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[19]  Sandeep K. S. Gupta,et al.  Improving on-demand data access efficiency in MANETs with cooperative caching , 2009, Ad Hoc Networks.

[20]  David E. Culler,et al.  A case for NOW (networks of workstation) , 1995, PODC '95.

[21]  Marius Paca Open-Domain Question Answering from Large Text Collections , 2003, Computational Linguistics.

[22]  Alec Wolman,et al.  On the scale and performance of cooperative Web proxy caching , 1999, SOSP.

[23]  Lakshmish Ramaswamy,et al.  An expiration age-based document placement scheme for cooperative Web caching , 2004, IEEE Transactions on Knowledge and Data Engineering.

[24]  Siddhartha Annapureddy,et al.  Shark: scaling file servers via cooperative caching , 2005, NSDI.

[25]  Aristides Gionis,et al.  The impact of caching on search engines , 2007, SIGIR.

[26]  Dan Roth,et al.  Learning question classifiers: the role of semantic information , 2005, Natural Language Engineering.

[27]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[28]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[29]  Mahadev Satyanarayanan,et al.  Scale and performance in a distributed file system , 1987, SOSP '87.

[30]  Sashikanth Chandrasekaran,et al.  Cache Fusion: Extending Shared-Disk Clusters with Shared Caches , 2001, VLDB.

[31]  Michael Dahlin,et al.  Coordinated Placement and Replacement for Large-Scale Distributed Caches , 2002, IEEE Trans. Knowl. Data Eng..

[32]  R KorupoluMadhukar,et al.  Coordinated Placement and Replacement for Large-Scale Distributed Caches , 2002 .

[33]  Evaggelia Pitoura,et al.  Cooperative XPath caching , 2008, SIGMOD Conference.

[34]  Ricardo Baeza-Yates,et al.  Web Usage Mining in Search Engines , 2005 .

[35]  Andrei Broder,et al.  Network Applications of Bloom Filters: A Survey , 2004, Internet Math..

[36]  Evangelos P. Markatos,et al.  On caching search engine query results , 2001, Comput. Commun..

[37]  Mihai Surdeanu,et al.  Cache-aware load balancing for question answering , 2008, CIKM '08.

[38]  Philip S. Yu,et al.  The state of the art in locally distributed Web-server systems , 2002, CSUR.

[39]  Sanda M. Harabagiu,et al.  Performance issues and error analysis in an open-domain question answering system , 2003, TOIS.

[40]  Jesús Labarta,et al.  Design issues of a cooperative cache with no coherence problems , 1997, IOPADS '97.

[41]  Sanda M. Harabagiu,et al.  Cogex: A semantically and contextually enriched logic prover for question answering , 2007, J. Appl. Log..

[42]  David E. Culler,et al.  A case for NOW (networks of workstation) , 1995, PODC '95.

[43]  Mihai Surdeanu,et al.  A Multi-layer Collaborative Cache for Question Answering , 2007, Euro-Par.

[44]  Duane Wessels,et al.  Internet Cache Protocol (ICP), version 2 , 1997, RFC.

[45]  M VoorheesEllen The TREC question answering track , 2001 .

[46]  Mihai Surdeanu,et al.  Named entity recognition from spontaneous open-domain speech , 2005, INTERSPEECH.

[47]  Xiaoning Ding,et al.  A Locality-Aware Cooperative Cache Management Protocol to Improve Network File System Performance , 2006, 26th IEEE International Conference on Distributed Computing Systems (ICDCS'06).

[48]  David Domínguez Sal,et al.  Analysis and optimization of question answering systems , 2010 .

[49]  John Kubiatowicz,et al.  Probabilistic location and routing , 2002, Proceedings.Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies.

[50]  Sanda M. Harabagiu,et al.  Performance analysis of a distributed question/answering system , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.