Peer-to-peer information retrieval using shared-content clustering

Peer-to-peer (p2p) networks are used by millions for searching and downloading content. Recently, clustering algorithms were shown to be useful for helping users find content in large networks. Yet, many of these algorithms overlook the fact that p2p networks follow graph models with a power-law node degree distribution. This paper studies the obtained clusters when applying clustering algorithms on power-law graphs and their applicability for finding content. Driven by the observed deficiencies, a simple yet efficient clustering algorithm is proposed, which targets a relaxed optimization of a minimal distance distribution of each cluster with a size balancing scheme. A comparative analysis using a song-similarity graph collected from 1.2 million Gnutella users reveals that commonly used efficiency measures often overlook search and recommendation applicability issues and provide the wrong impression that the resulting clusters are well suited for these tasks. We show that the proposed algorithm performs well on various measures that are well suited for the domain.

[1]  D. Wilkinson,et al.  Social Network Collaborative Filtering , 2008 .

[2]  Inderjit S. Dhillon,et al.  Weighted Graph Cuts without Eigenvectors A Multilevel Approach , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Ruoming Jin,et al.  Fast and exact out-of-core and distributed k-means clustering , 2006, Knowledge and Information Systems.

[4]  Paul Resnick,et al.  Recommender systems , 1997, CACM.

[5]  Sophie Ahrens,et al.  Recommender Systems , 2012 .

[6]  Yuval Shavitt,et al.  Mining Music from Large-Scale, Peer-to-Peer Networks , 2011, IEEE MultiMedia.

[7]  Michael Barbehenn,et al.  A Note on the Complexity of Dijkstra's Algorithm for Graphs with Weighted Vertices , 1998, IEEE Trans. Computers.

[8]  Stefan Saroiu,et al.  Finding Content in File-Sharing Networks When You Can't Even Spell , 2007, IPTPS.

[9]  Peter Knees,et al.  The Quest for Ground Truth in Musical Artist Tagging in the Social Web Era , 2007, ISMIR.

[10]  J. Platt Fast embedding of sparse music similarity graphs , 2003, NIPS 2003.

[11]  Hector Garcia-Molina,et al.  Improving search in peer-to-peer networks , 2002, Proceedings 22nd International Conference on Distributed Computing Systems.

[12]  Michael Schumacher,et al.  Extending peer-to-peer networks for approximate search , 2008, SAC '08.

[13]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[14]  Christos Faloutsos,et al.  PEGASUS: mining peta-scale graphs , 2011, Knowledge and Information Systems.

[15]  Hui Xiong,et al.  Adapting the right measures for K-means clustering , 2009, KDD.

[16]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[17]  Pedro Cano,et al.  From hits to niches?: or how popular artists can bias music recommendation and discovery , 2008, NETFLIX '08.

[18]  Yuval Shavitt,et al.  On the Applicability of Peer-to-peer Data in Music Information Retrieval Research , 2010, ISMIR.

[19]  Yuval Shavitt,et al.  Geographical Statistics and Characteristics of P2P Query Strings , 2007, IPTPS.

[20]  Shigenobu Kobayashi,et al.  Large-Scale k-Means Clustering with User-Centric Privacy Preservation , 2008, PAKDD.

[21]  Walter Willinger,et al.  On unbiased sampling for unstructured peer-to-peer networks , 2009, TNET.

[22]  Daniel Stutzbach,et al.  Characterizing unstructured overlay topologies in modern P2P file-sharing systems , 2008, TNET.

[23]  Bruce M. Maggs,et al.  Efficient content location using interest-based locality in peer-to-peer systems , 2003, IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No.03CH37428).

[24]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[25]  Derek Greene,et al.  Partitioning large networks without breaking communities , 2010, Knowledge and Information Systems.

[26]  Béla Bollobás,et al.  The Diameter of a Scale-Free Random Graph , 2004, Comb..

[27]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[28]  Anne-Marie Kermarrec,et al.  Clustering in Peer-to-Peer File Sharing Workloads , 2004, IPTPS.

[29]  Jonathan L. Herlocker,et al.  Evaluating collaborative filtering recommender systems , 2004, TOIS.

[30]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[31]  I K Fodor,et al.  A Survey of Dimension Reduction Techniques , 2002 .

[32]  Emin Gün Sirer,et al.  Hyperspaces for Object Clustering and Approximate Matching in Peer-to-Peer Overlays , 2007, HotOS.

[33]  A. Raftery,et al.  Model‐based clustering for social networks , 2007 .

[34]  Jon Kleinberg,et al.  KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining , 2007, KDD 2007.

[35]  Sam Yuan Sung,et al.  Knowledge and Information Systems , 2006 .

[36]  Hui Xiong,et al.  Distributed classification in peer-to-peer networks , 2007, KDD '07.

[37]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[38]  Daniel Stutzbach,et al.  Characterizing unstructured overlay topologies in modern P2P file-sharing systems , 2005 .

[39]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[40]  Matei Ripeanu,et al.  Peer-to-peer architecture case study: Gnutella network , 2001, Proceedings First International Conference on Peer-to-Peer Computing.

[41]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[42]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[43]  Fabio Vignoli,et al.  Virtual Communities for Creating Shared Music Channels , 2007, ISMIR.

[44]  Oded Maimon,et al.  Evaluation of gene-expression clustering via mutual information distance measure , 2007, BMC Bioinformatics.

[45]  S. vanDongen Performance criteria for graph clustering and Markov cluster experiments , 2000 .

[46]  Krishna P. Gummadi,et al.  Measuring and analyzing the characteristics of Napster and Gnutella hosts , 2003, Multimedia Systems.

[47]  Fei Wang,et al.  Improving clustering by learning a bi-stochastic data similarity matrix , 2011, Knowledge and Information Systems.

[48]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[49]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[50]  Yuval Shavitt,et al.  Estimating peer similarity using distance of shared files , 2010, IPTPS.

[51]  Anne-Marie Kermarrec,et al.  Exploiting semantic proximity in peer-to-peer content searching , 2004, Proceedings. 10th IEEE International Workshop on Future Trends of Distributed Computing Systems, 2004. FTDCS 2004..

[52]  Yuval Shavitt,et al.  Song Clustering Using Peer-to-Peer Co-occurrences , 2009, 2009 11th IEEE International Symposium on Multimedia.

[53]  Srinivasan Parthasarathy,et al.  Scalable graph clustering using stochastic flows: applications to community discovery , 2009, KDD.