Distributed Identification of Top-l Inner Product Elements and its Application in a Peer-to-Peer Network

The inner product measures how closely two feature vectors are related. It is an important primitive for many popular data mining tasks, for example, clustering, classification, correlation computation, and decision tree construction. If the entire data set is available at a single site, then computing the inner product matrix and identifying the top (in terms of magnitude) entries is trivial. However, in many real-world scenarios, data is distributed across many locations and transmitting the data to a central server would be quite communication intensive and not scalable. This paper presents an approximate local algorithm for identifying top-l, inner products among pairs of feature vectors in a large asynchronous distributed environment such as a peer-to-peer (P2P) network. We develop a probabilistic algorithm for this purpose using order statistics and the Hoeffding bound. We present experimental results to show the effectiveness and scalability of the algorithm. Finally, we demonstrate an application of this technique for interest-based community formation in a P2P environment.

[1]  Ronald Fagin,et al.  Combining Fuzzy Information from Multiple Systems , 1999, J. Comput. Syst. Sci..

[2]  Christos G. Cassandras,et al.  Ordinal optimisation and simulation , 2000, J. Oper. Res. Soc..

[3]  Kun Liu,et al.  Random projection-based multiplicative data perturbation for privacy preserving distributed data mining , 2006, IEEE Transactions on Knowledge and Data Engineering.

[4]  Ujjwal Maulik,et al.  Clustering distributed data streams in peer-to-peer environments , 2006, Inf. Sci..

[5]  Ibrahim Matta,et al.  BRITE: an approach to universal topology generation , 2001, MASCOTS 2001, Proceedings Ninth International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[6]  Charu C. Aggarwal,et al.  On k-Anonymity and the Curse of Dimensionality , 2005, VLDB.

[7]  Silvana Castano,et al.  Semantic Self-Formation of Communities of Peers , 2005 .

[8]  Hillol Kargupta,et al.  Approximate Distributed K-Means Clustering over a Peer-to-Peer Network , 2009, IEEE Transactions on Knowledge and Data Engineering.

[9]  Richard P. Martin,et al.  PlanetP: using gossiping to build content addressable peer-to-peer information sharing communities , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[10]  Ran Wolff,et al.  Local L2-Thresholding Based Data Mining in Peer-to-Peer Systems , 2006, SDM.

[11]  Partha Dasgupta,et al.  EFFICIENT DISCOVERY OF IMPLICITLY FORMED PEER-TO-PEER COMMUNITIES # , 2002 .

[12]  A. Maslow Motivation and personality, 3rd ed. , 1987 .

[13]  John Scott What is social network analysis , 2010 .

[14]  A. Shapiro Monte Carlo Sampling Methods , 2003 .

[15]  Kagan Tumer,et al.  Robust Combining of Disparate Classifiers through Order Statistics , 1999, Pattern Analysis & Applications.

[16]  Rajeev Motwani,et al.  Estimating Aggregates on a Peer-to-Peer Network , 2003 .

[17]  Susan Gauch,et al.  Improving Ontology-Based User Profiles , 2004, RIAO.

[18]  Suresh Jagannathan,et al.  Distributed Uniform Sampling in Unstructured Peer-to-Peer Networks , 2006, Proceedings of the 39th Annual Hawaii International Conference on System Sciences (HICSS'06).

[19]  Santosh S. Vempala,et al.  An algorithmic theory of learning: Robust concepts and random projection , 1999, Machine Learning.

[20]  Hillol Kargupta,et al.  SEARCH, Computational Processes in Evolution, and Preliminary Development of the Gene Expression Messy Genetic Algorithm , 1997, Complex Syst..

[21]  Dahlia Malkhi,et al.  Estimating network size from local information , 2003, Information Processing Letters.

[22]  Pekka Orponen,et al.  Efficient Algorithms for Sampling and Clustering of Large Nonuniform Networks , 2004 .

[23]  Christopher Olston,et al.  Distributed top-k monitoring , 2003, SIGMOD '03.

[24]  Hillol Kargupta,et al.  Uniform Data Sampling from a Peer-to-Peer Network , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[25]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[26]  Wolf-Tilo Balke,et al.  Progressive distributed top-k retrieval in peer-to-peer networks , 2005, 21st International Conference on Data Engineering (ICDE'05).

[27]  Richard J. Lipton,et al.  Random walks, universal traversal sequences, and the complexity of maze problems , 1979, 20th Annual Symposium on Foundations of Computer Science (sfcs 1979).

[28]  Ömer Egecioglu,et al.  Dimensionality reduction and similarity computation by inner-product approximations , 2000, IEEE Transactions on Knowledge and Data Engineering.

[29]  John Scott Social Network Analysis , 1988 .

[30]  Julita Vassileva,et al.  Trust-Based Community Formation in Peer-to-Peer File Sharing Networks , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[31]  Philip K. Chan,et al.  Advances in Distributed and Parallel Knowledge Discovery , 2000 .

[32]  Alon Y. Halevy,et al.  Semantic Integration , 2005, AI Mag..

[33]  A. Maslow Motivation and Personality , 1954 .

[34]  Ran Wolff,et al.  Association rule mining in peer-to-peer systems , 2003, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[35]  L. Asz Random Walks on Graphs: a Survey , 2022 .

[36]  Kun Liu,et al.  Communication efficient construction of decision trees over heterogeneously distributed data , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[37]  Stefan Saroiu,et al.  A Measurement Study of Peer-to-Peer File Sharing Systems , 2001 .

[38]  Sanjeev Khanna,et al.  Power-conserving computation of order-statistics over sensor networks , 2004, PODS.

[39]  Natalya F. Noy,et al.  Semantic integration: a survey of ontology-based approaches , 2004, SGMD.

[40]  Jerome H. Friedman,et al.  On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality , 2004, Data Mining and Knowledge Discovery.

[41]  Kun Liu,et al.  Client-side web mining for community formation in peer-to-peer environments , 2006, SKDD.

[42]  Yelena Yesha,et al.  Data Mining: Next Generation Challenges and Future Directions , 2004 .

[43]  Ran Wolff,et al.  Distributed Data Mining in Peer-to-Peer Networks , 2006, IEEE Internet Computing.

[44]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[45]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[46]  Hillol Kargupta,et al.  K-Means Clustering Over a Large, Dynamic Network , 2006, SDM.