Distributed Optimization Strategies for Mining on Peer-to-Peer Networks

Peer-to-peer (P2P) networks are distributed systems in which nodes of equal roles and capabilities exchange information and services directly with each other. In recent years, they have become a popular way to share large amounts of data. However, such an architecture adds a new dimension to the process of knowledge discovery and data mining -- the challenge of mining distributed (and often) dynamic sources of data and computing. Furthermore, effective utilization of the distributed resources needs to be carefully analyzed. In this paper, we study the problem of optimization of resources to enable efficient and scalable mining on a peer-to-peer (P2P) network. We develop a crawler based on the Gnutella protocol and use it to simulate a P2P network on which we run a classification task. Our results from the case-study indicate that not only do we have an effective utilization of resources but also the accuracy of the distributed mining algorithm is likely to be close to the hypothetical scenario where all data in the network is stored in a central location.

[1]  M - Estimating Aggregates on a Peer-to-Peer Network , 2003 .

[2]  Daniel Stutzbach,et al.  Characterizing unstructured overlay topologies in modern P2P file-sharing systems , 2008, TNET.

[3]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[4]  Ran Wolff,et al.  A Local Facility Location Algorithm for Sensor Networks , 2005, DCOSS.

[5]  David P. Anderson,et al.  SETI@home: an experiment in public-resource computing , 2002, CACM.

[6]  Robert L. Grossman,et al.  Balancing cost and accuracy in distributed data mining , 2002 .

[7]  Ran Wolff,et al.  Local L2-Thresholding Based Data Mining in Peer-to-Peer Systems , 2006, SDM.

[8]  Demetris Zeinalipour-Yazti,et al.  A Quantitative Analysis of the Gnutella Network Trac , 2002 .

[9]  Rajeev Motwani,et al.  Estimating Aggregates on a Peer-to-Peer Network , 2003 .

[10]  Ran Wolff,et al.  Distributed Data Mining in Peer-to-Peer Networks , 2006, IEEE Internet Computing.

[11]  Hector Garcia-Molina,et al.  Ad Hoc, self-supervising peer-to-peer search networks , 2005, TOIS.

[12]  Takashige Hoshiai,et al.  Decentralized Meta-Data Strategies: Effective Peer-to-Peer Search , 2003 .

[13]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[14]  George B. Dantzig,et al.  Linear programming and extensions , 1965 .

[15]  Ran Wolff,et al.  A high-performance distributed algorithm for mining association rules , 2004, Knowledge and Information Systems.

[16]  Lada A. Adamic,et al.  Search in Power-Law Networks , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[17]  Nolan Li,et al.  Batch is back: CasJobs, serving multi-TB data on the Web , 2005, IEEE International Conference on Web Services (ICWS'05).

[18]  Ophir Frieder,et al.  A Tool for Information Retrieval Research in Peer-to-Peer File Sharing Systems , 2007, 2007 IEEE 23rd International Conference on Data Engineering.