A communication efficient probabilistic algorithm for mining frequent itemsets from a peer-to-peer network

Data intensive large-scale distributed systems like peer-to-peer (P2P) networks are becoming increasingly popular where centralization of data is impossible for mining and analysis. Unfortunately, most of the existing data mining algorithms work only when data can be accessed in its entirety. Finding all the network-wide frequent itemsets is computationally difficult and usually has large communication overhead in such environment. This paper focuses on developing a communication efficient algorithm for discovering frequent itemsets from a P2P network. A sampling-based approach is adopted to find approximate solution instead of an exact solution with probabilistic guarantee. The benefit of approximation technique is reflected in the low communication overhead in discovering majority of frequent itemsets with probabilistic guarantee. The main principal followed by the algorithm assumes that an independent and identically distributed (iid) sample of the entire data is available at one location to generate a set of candidate itemsets. Collecting iid sample from a P2P network is a challenging problem because of varying degrees of connectivity and sizes of data shared. The paper first addresses this issue and shows how an iid sample of nodes and data can be collected from a P2P network using random walk. It applies the proposed sampling technique to identify most of the frequent itemsets from a P2P network. Theoretical analysis shows how to decide about optimum sample size and minimize communication to compute the results. Experimental results show that the proposed algorithm discovers all of the network-wide frequent itemsets using communication that scales sublinearly with network and datasize. © 2009 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 2: 48-69, 2009

[1]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[2]  Zhizhang Shen,et al.  Average diameter of network structures and its estimation , 1998, SAC '98.

[3]  Vipin Kumar,et al.  Scalable Parallel Data Mining for Association Rules , 2000, IEEE Trans. Knowl. Data Eng..

[4]  Rajeev Motwani,et al.  Randomized Algorithms , 1995, SIGA.

[5]  Ashfaq Khokhar,et al.  Frequent Pattern Mining on Message Passing Multiprocessor Systems , 2004, Distributed and Parallel Databases.

[6]  A. Schuster,et al.  Association rule mining in peer-to-peer systems , 2004, IEEE Trans. Syst. Man Cybern. Part B.

[7]  Hui Xiong,et al.  Distributed classification in peer-to-peer networks , 2007, KDD '07.

[8]  Krishna P. Gummadi,et al.  Measuring and analyzing the characteristics of Napster and Gnutella hosts , 2003, Multimedia Systems.

[9]  Kun Liu,et al.  Client-side web mining for community formation in peer-to-peer environments , 2006, SKDD.

[10]  H. Chernoff A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .

[11]  David D. Jensen,et al.  Creating social networks to improve peer-to-peer networking , 2005, KDD '05.

[12]  Ran Wolff,et al.  Communication-Efficient Distributed Mining of Association Rules , 2001, SIGMOD '01.

[13]  David B. Skillicorn Parallel frequent set counting , 2002, Parallel Comput..

[14]  Kun Liu,et al.  Distributed Identification of Top-l Inner Product Elements and its Application in a Peer-to-Peer Network , 2008, IEEE Transactions on Knowledge and Data Engineering.

[15]  László Lovász,et al.  Random Walks on Graphs: A Survey , 1993 .

[16]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[17]  Rakesh Agrawal,et al.  Parallel Mining of Association Rules , 1996, IEEE Trans. Knowl. Data Eng..