OverCite: A Distributed, Cooperative CiteSeer

CiteSeer is a popular online resource for the computer science research community, allowing users to search and browse a large archive of research papers. CiteSeer is expensive: it generates 35 GB of network traffic per day, requires nearly one terabyte of disk storage, and needs significant human maintenance. OverCite is a new digital research library system that aggregates donated resources at multiple sites to provide CiteSeer-like document search and retrieval. OverCite enables members of the community to share the costs of running CiteSeer. The challenge facing OverCite is how to provide scalable and load-balanced storage and query processing with automatic data management. OverCite uses a three-tier design: presentation servers provide an identical user interface to CiteSeer's; application servers partition and replicate a search index to spread the work of answering each query among several nodes; and a distributed hash table stores documents and metadata, and coordinates the activities of the servers. Evaluation of a prototype shows that OverCite increases its query throughput by a factor of seven with a nine-fold increase in the number of servers. OverCite requires more total storage and network bandwidth than centralized CiteSeer, but spreads these costs over all the sites. OverCite can exploit the resources of these sites to support new features such as document alerts and to scale to larger data sets.

[1]  Scott Shenker,et al.  Internet indirection infrastructure , 2004, IEEE/ACM Transactions on Networking.

[2]  A. Rowstron,et al.  Past: persistent and anonymous storage in a peer-to-peer networking environment , 2001 .

[3]  Ralph C. Merkle,et al.  A Digital Signature Based on a Conventional Encryption Function , 1987, CRYPTO.

[4]  MacKenzie Smith DSpace for e-print archives , 2004 .

[5]  Norman Paskin The digital object identifier system: digital technology meets content management , 1999 .

[6]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[7]  James Robertson,et al.  UsenetDHT: A Low Overhead Usenet Server , 2004, IPTPS.

[8]  Vicky Reich,et al.  Permanent Web Publishing , 2000, USENIX Annual Technical Conference, FREENIX Track.

[9]  Gurmeet Singh Manku,et al.  SETS: search enhanced by topic segmentation , 2003, SIGIR.

[10]  Ion Stoica,et al.  The Case for a Hybrid P2P Search Infrastructure , 2004, IPTPS.

[11]  David Mazières,et al.  OASIS: Anycast for Any Service , 2006, NSDI.

[12]  David Mazières,et al.  Democratizing Content Publication with Coral , 2004, NSDI.

[13]  Larry L. Peterson,et al.  Reliability and Security in the CoDeeN Content Distribution Network , 2004, USENIX Annual Technical Conference, General Track.

[14]  David R. Karger,et al.  On the Feasibility of Peer-to-Peer Web Indexing and Search , 2003, IPTPS.

[15]  Brighten Godfrey,et al.  OpenDHT: a public DHT service and its uses , 2005, SIGCOMM '05.

[16]  Sriram Ramabhadran,et al.  A case study in building layered DHT applications , 2005, SIGCOMM '05.

[17]  Andreas Haeberlen,et al.  NSDI '06: 3rd Symposium on Networked Systems Design & Implementation , 2006 .

[18]  David Mazières,et al.  A Toolkit for User-Level File Systems , 2001, USENIX Annual Technical Conference, General Track.

[19]  G. Cox,et al.  ~ " " " ' l I ~ " " -" . : -· " J , 2006 .

[20]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[21]  Sandhya Dwarkadas,et al.  Hybrid Global-Local Indexing for Efficient Peer-to-Peer Information Retrieval , 2004, NSDI.

[22]  David R. Karger,et al.  Wide-area cooperative storage with CFS , 2001, SOSP.

[23]  Maxwell N. Krohn,et al.  Building Secure High-Performance Web Services with OKWS , 2004, USENIX Annual Technical Conference, General Track.

[24]  David E. Culler,et al.  Scalable, Distributed Data Structures for Internet Service Construction , 2000, OSDI.

[25]  Hector Garcia-Molina,et al.  Improving search in peer-to-peer networks , 2002, Proceedings 22nd International Conference on Distributed Computing Systems.

[26]  Scott Shenker,et al.  Making gnutella-like P2P systems scalable , 2003, SIGCOMM '03.

[27]  Mudhakar Srivatsa,et al.  Apoidea: A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web , 2003, Distributed Multimedia Information Retrieval.

[28]  Guangwen Yang,et al.  Making Peer-to-Peer Keyword Searching Feasible Using Multi-level Partitioning , 2004, IPTPS.

[29]  Timo Burkard,et al.  Herodotus: A Peer-to-Peer Web Archival System , 2002 .

[30]  Omprakash D. Gnawali A Keyword-Set Search System for Peer-to-Peer Networks , 2002 .

[31]  Eric A. Brewer,et al.  Cluster-based scalable network services , 1997, SOSP.

[32]  Torsten Suel,et al.  ODISSEA: A Peer-to-Peer Architecture for Scalable Web Search and Information Retrieval , 2003, WebDB.

[33]  Amin Vahdat,et al.  Efficient Peer-to-Peer Keyword Searching , 2003, Middleware.

[34]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[35]  Scott Shenker,et al.  Internet indirection infrastructure , 2004, TNET.

[36]  R. Anderson The Eternity Service , 1996 .

[37]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[38]  Robert Tappan Morris,et al.  Designing a DHT for Low Latency and High Throughput , 2004, NSDI.

[39]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[40]  Scott Shenker,et al.  Querying the Internet with PIER , 2003, VLDB.

[41]  David E. Culler,et al.  Distributed data structures for internet service construction , 2000, USENIX Symposium on Operating Systems Design and Implementation.

[42]  David R. Karger,et al.  OverCite: A Cooperative Digital Research Library , 2005, IPTPS.

[43]  Vaibhav J. Padliya PeerCrawl A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web , 2006 .

[44]  Hari Balakrishnan,et al.  Resilient overlay networks , 2001, SOSP.

[45]  Sujata Banerjee,et al.  SmartSeer: Using a DHT to Process Continuous Queries Over Peer-to-Peer Networks , 2006, Proceedings IEEE INFOCOM 2006. 25TH IEEE International Conference on Computer Communications.

[46]  B. Cohen,et al.  Incentives Build Robustness in Bit-Torrent , 2003 .

[47]  Boon Thau Loo,et al.  Distributed Web Crawling over DHTs , 2004 .