Approximate Object Location and Spam Filtering on Peer-to-Peer Systems

Recent work in P2P overlay networks allow for decentralized object location and routing (DOLR) across networks based on unique IDs. In this paper, we propose an extension to DOLR systems to publish objects using generic feature vectors instead of content-hashed GUIDs, which enables the systems to locate similar objects. We discuss the design of a distributed text similarity engine, named Approximate Text Addressing (ATA), built on top of this extension that locates objects by their text descriptions. We then outline the design and implementation of a motivating application on ATA, a decentralized spam-"ltering service. We evaluate this system with 30,000 real spam email messages and 10,000 non-spam messages, and "nd a spam identi" cation ratio of over 97% with zero false positives.

[1]  Ben Y. Zhao,et al.  An Infrastructure for Fault-tolerant Wide-area Location and Routing , 2001 .

[2]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM 2001.

[3]  Peter Druschel,et al.  Pastry: Scalable, distributed object location and routing for large-scale peer-to- , 2001 .

[4]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[5]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[6]  Ben Y. Zhao,et al.  Towards a Common API for Structured Peer-to-Peer Overlays , 2003, IPTPS.

[7]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[8]  A. Broder Some applications of Rabin’s fingerprinting method , 1993 .

[9]  David Mazières,et al.  Kademlia: A Peer-to-Peer Information System Based on the XOR Metric , 2002, IPTPS.

[10]  Ben Y. Zhao,et al.  Distributed Object Location in a Dynamic Network , 2002, SPAA '02.

[11]  Michael B. Jones,et al.  SkipNet: A Scalable Overlay Network with Practical Locality Properties , 2003, USENIX Symposium on Internet Technologies and Systems.

[12]  Ben Y. Zhao,et al.  Tapestry: An Infrastructure for Fault-tolerant Wide-area Location and , 2001 .

[13]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[14]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[15]  David R. Karger,et al.  On the Feasibility of Peer-to-Peer Web Indexing and Search , 2003, IPTPS.

[16]  William R. Hersh,et al.  Managing Gigabytes—Compressing and Indexing Documents and Images (Second Edition) , 2001, Information Retrieval.

[17]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[18]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .