The use of rare key indexing for distributed web search

In the last few years we have seen a rise in the of peer-to-peer applications in areas like file sharing. However distributed information retrieval applications have not taken off yet. In such an application every peer (website) helps to maintain a global index of all information in the global document collection. When a website is updated the P2P search engine index can also be directly updated by the peer. Each peer only needs to contribute a limited amount of disk space and network bandwidth. Groups of websites can even form their own search engine which specializes in a their area of expertise. In this way the search process can be driven more by the Internet community. In the first part of this thesis the previous work done in the field of distributed information retrieval is discussed. Most of the recently developed systems (like ALVIS, Minerva and pSearch) use a conceptually global but physically distributed index. This index is distributed using a distributed hash table (DHT) based approach. The scalability of such systems is very good and the reported retrieval performance also approaches that of a centralized information retrieval system. However each project uses a different collection and a different set of queries to test their application. Therefore we cannot compare them directly with each other. In the second part I discuss the implementation and evaluation of a distributed information retrieval system based on rare key indexing. Such an index stores sets of terms that appear near each other in a limited number of documents. This approach was first presented as part of the ALVIS project. In this thesis we tested the suitability of the approach for indexing and searching a realistic collection of websites using a subset of the WT10g collection. To measure performance we look at the top-10 overlap between a single term index and a multi term index. In the best case the average overlap ratio was found to be only 7.5%. In the ALVIS project the average overlap ratio was between 83% and 97%. I outline several causes that attribute to this huge difference. Based on the outcome of the experiments I have to conclude that the rare key indexing method scales well. However its retrieval performance on a realistic collection of websites is very poor. Therefore the rare key indexing method cannot be considered a good choice for a distributed web search application.

[1]  Antony I. T. Rowstron,et al.  Squirrel: a decentralized peer-to-peer web cache , 2002, PODC '02.

[2]  Wolf-Tilo Balke,et al.  Progressive distributed top-k retrieval in peer-to-peer networks , 2005, 21st International Conference on Data Engineering (ICDE'05).

[3]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[4]  Zhichen Xu,et al.  pSearch: information retrieval in structured overlays , 2003, CCRV.

[5]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[6]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[7]  Ben Y. Zhao,et al.  An Infrastructure for Fault-tolerant Wide-area Location and Routing , 2001 .

[8]  T. Sherratt,et al.  Hiding in plain sight. , 2004, Trends in ecology & evolution.

[9]  Rajmohan Rajaraman,et al.  Accessing Nearby Copies of Replicated Objects in a Distributed Environment , 1997, SPAA '97.

[10]  Antony I. T. Rowstron,et al.  PAST: a large-scale, persistent peer-to-peer storage utility , 2001, Proceedings Eighth Workshop on Hot Topics in Operating Systems.

[11]  Robert Morris,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM 2001.

[12]  Jie Lu,et al.  Content-based retrieval in hybrid peer-to-peer networks , 2003, CIKM '03.

[13]  Peter Druschel,et al.  Pastry: Scalable, distributed object location and routing for large-scale peer-to- , 2001 .

[14]  Stephen E. Robertson,et al.  Probabilistic models of indexing and searching , 1980, SIGIR '80.

[15]  David R. Karger,et al.  On the Feasibility of Peer-to-Peer Web Indexing and Search , 2003, IPTPS.

[16]  Karl Aberer,et al.  A Query-Adaptive Partial Distributed Hash Table for Peer-to-Peer Systems , 2004, EDBT Workshops.

[17]  Charles Darwin,et al.  Experiments , 1800, The Medical and physical journal.

[18]  Miguel Castro,et al.  SplitStream: high-bandwidth multicast in cooperative environments , 2003, SOSP '03.

[19]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[20]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[21]  A Survey of Peer-to-Peer Networks , 2005 .