Optimizing Near Duplicate Detection for P2P Networks

In this paper, we propose a probabilistic algorithm for detecting near duplicate text, audio, and video resources efficiently and effectively in large-scale P2P systems. To this end, we present a thorough cost and probabilistic analysis that allows the algorithm to adapt to network and data collection characteristics for minimizing network cost. In addition, we extend the algorithm so that it can identify similar videos, even if some of the videos are split into different files. A thorough theoretical analysis as well as a large-scale experimental evaluation on networks of up to 100,000 peers using real-world datasets of more than 200 Gbytes demonstrate the viability of our approach.

[1]  Yan Ke,et al.  Efficient Near-duplicate Detection and Sub-image Retrieval , 2004 .

[2]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[3]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[4]  Seif Haridi,et al.  Efficient Broadcast in Structured P2P Networks , 2003, IPTPS.

[5]  Athman Bouguettaya,et al.  An Efficient Near-Duplicate Video Shot Detection Method Using Shot-Based Interest Points , 2009, IEEE Transactions on Multimedia.

[6]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[7]  Mayank Bawa,et al.  LSH forest: self-tuning indexes for similarity search , 2005, WWW '05.

[8]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[9]  Karl Aberer,et al.  Distributed similarity search in high dimensions using locality sensitive hashing , 2009, EDBT '09.

[10]  Beng Chin Ooi,et al.  Continuous Content-Based Copy Detection over Streaming Videos , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[11]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[12]  Zhe Wang,et al.  Modeling LSH for performance tuning , 2008, CIKM '08.

[13]  Cheng Yang Peer-to-peer architecture for content-based music retrieval on acoustic data , 2003, WWW '03.

[14]  Atreyi Kankanhalli,et al.  Automatic partitioning of full-motion video , 1993, Multimedia Systems.

[15]  Anne-Marie Kermarrec,et al.  Peer counting and sampling in overlay networks based on random walks , 2007, Distributed Computing.

[16]  Vegard Andreas Larsen,et al.  Combining Audio Fingerprints , 2008 .

[17]  Chong-Wah Ngo,et al.  Practical elimination of near-duplicates from web video search , 2007, ACM Multimedia.

[18]  Haibin Liu,et al.  Video linkage: group based copied video detection , 2008, CIVR '08.

[19]  Chirag Shah Tubekit: a query-based youtube crawling toolkit , 2008, JCDL '08.

[20]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[21]  Yan Ke,et al.  An efficient parts-based near-duplicate and sub-image retrieval system , 2004, MULTIMEDIA '04.

[22]  Pavel Zezula,et al.  A distributed incremental nearest neighbor algorithm , 2007, Infoscale.

[23]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[24]  Chong-Wah Ngo,et al.  Threading and autodocumenting news videos: a promising solution to rapidly browse news topics , 2006, IEEE Signal Processing Magazine.