Efficient Incremental Near Duplicate Detection Based on Locality Sensitive Hashing

In this paper, we study the problem of detecting near duplicates for high dimensional data points in an incremental manner. For example, for an image sharing website, it would be a desirable feature if near-duplicates can be detected whenever a user uploads a new image into the website so that the user can take some action such as stopping the upload or reporting an illegal copy. Specifically, whenever a new point arrives, our goal is to find all points within an existing point set that are close to the new point based on a given distance function and a distance threshold before the new point is inserted into the data set. Based on a well-known indexing technique, Locality Sensitive Hashing, we propose a new approach which clearly speeds up the running time of LSH indexing while using only a small amount of extra space. The idea is to store a small fraction of near duplicate pairs within the existing point set which are found when they are inserted into the data set, and use them to prune LSH candidate sets for the newly arrived point. Extensive experiments based on three real-world data sets show that our method consistently outperforms the original LSH approach: to reach the same query response time, our method needs significantly less memory than the original LSH approach. Meanwhile, the LSH theoretical guarantee on the quality of the search result is preserved by our approach. Furthermore, it is easy to implement our approach based on LSH.

[1]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.

[2]  Robert Krauthgamer,et al.  Navigating nets: simple algorithms for proximity search , 2004, SODA '04.

[3]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[4]  Hanan Samet,et al.  Foundations of multidimensional and metric data structures , 2006, Morgan Kaufmann series in data management systems.

[5]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[6]  Beng Chin Ooi,et al.  Approximate NN queries on Streams with Guaranteed Error/performance Bounds , 2004, VLDB.

[7]  Matthieu Guillaumin,et al.  Combining Image-Level and Segment-Level Models for Automatic Annotation , 2012, MMM.

[8]  Zhengrong Yao,et al.  Evaluating continuous nearest neighbor queries for streaming time series via pre-fetching , 2002, CIKM '02.

[9]  Mayank Bawa,et al.  LSH forest: self-tuning indexes for similarity search , 2005, WWW '05.

[10]  Xiang Lian,et al.  Efficient Similarity Search over Future Stream Time Series , 2008, IEEE Transactions on Knowledge and Data Engineering.

[11]  Justin Zobel,et al.  Discovery of Image Versions in Large Collections , 2007, MMM.

[12]  Beng Chin Ooi,et al.  Indexing the Distance: An Efficient Method to KNN Processing , 2001, VLDB.

[13]  Anthony K. H. Tung,et al.  LDC: enabling search by partial distance in a hyper-dimensional space , 2004, Proceedings. 20th International Conference on Data Engineering.

[14]  R. Fergus,et al.  Tiny images , 2007 .

[15]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[16]  Zhe Wang,et al.  Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.

[17]  Christian Böhm,et al.  Independent quantization: an index compression technique for high-dimensional data spaces , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[18]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[19]  Michael Isard,et al.  General Theory , 1969 .

[20]  Rina Panigrahy,et al.  Entropy based nearest neighbor search in high dimensions , 2005, SODA '06.

[21]  Like Gao,et al.  Continuous similarity-based queries on streaming time series , 2005, IEEE Transactions on Knowledge and Data Engineering.

[22]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[23]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[24]  Rafael C. González,et al.  Digital image processing, 3rd Edition , 2008 .

[25]  Fan Deng,et al.  Approximately detecting duplicates for streaming data using stable bloom filters , 2006, SIGMOD Conference.

[26]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[27]  Yan Ke,et al.  An efficient parts-based near-duplicate and sub-image retrieval system , 2004, MULTIMEDIA '04.

[28]  Marios Hadjieleftheriou,et al.  R-Trees - A Dynamic Index Structure for Spatial Searching , 2008, ACM SIGSPATIAL International Workshop on Advances in Geographic Information Systems.

[29]  Masatoshi Yoshikawa,et al.  The A-tree: An Index Structure for High-Dimensional Spaces Using Relative Approximation , 2000, VLDB.

[30]  Hanan Samet,et al.  Metric space similarity joins , 2008, TODS.