Large-scale similarity data management with distributed Metric Index

Metric space is a universal and versatile model of similarity that can be applied in various areas of non-text information retrieval. However, a general, efficient and scalable solution for metric data management is still a resisting research challenge. In this work, we try to make an important step towards such management system that would be able to scale to data collections of billions of objects. We propose a distributed index structure for similarity data management called the Metric Index (M-Index) which can answer queries in precise and approximate manner. This technique can take advantage of any distributed hash table that supports interval queries and utilize it as an underlying index. We have performed numerous experiments to test various settings of the M-Index structure and we have proved its usability by developing a full-featured publicly-available Web application.

[1]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[2]  David Novak,et al.  Generic similarity search engine demonstrated by an image retrieval application , 2009, SIGIR.

[3]  Pavel Zezula,et al.  Similarity Search: The Metric Space Approach (Advances in Database Systems) , 2005 .

[4]  Hanan Samet,et al.  Foundations of multidimensional and metric data structures , 2006, Morgan Kaufmann series in data management systems.

[5]  Andrea Esuli,et al.  CoPhIR: a Test Collection for Content-Based Image Retrieval , 2009, ArXiv.

[6]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[7]  David Novak,et al.  On scalability of the similarity search in the world of peers , 2006, InfoScale '06.

[8]  Ronald Fagin,et al.  Extendible hashing—a fast access method for dynamic files , 1979, ACM Trans. Database Syst..

[9]  David Novak,et al.  On locality-sensitive indexing in generic metric spaces , 2010, SISAP.

[10]  Christos Doulkeridis,et al.  Peer-to-Peer Similarity Search in Metric Spaces , 2007, VLDB.

[11]  Karl Aberer,et al.  P-Grid: A Self-Organizing Access Structure for P2P Information Systems , 2001, CoopIS.

[12]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[13]  Matthew Skala Counting Distance Permutations , 2008, SISAP.

[14]  Andrea Esuli,et al.  PP-Index: Using Permutation Prefixes for Efficient and Scalable Approximate Similarity Search , 2009, LSDS-IR@SIGIR.

[15]  David Novak,et al.  Metric Index: An efficient and scalable solution for precise and approximate similarity search , 2011, Inf. Syst..

[16]  Pavel Zezula,et al.  A Content-Addressable Network for Similarity Search in Metric Spaces , 2005, DBISP2P.

[17]  Gonzalo Navarro,et al.  Effective Proximity Retrieval by Ordering Permutations , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Pavel Zezula,et al.  Similarity Search - The Metric Space Approach , 2005, Advances in Database Systems.

[19]  David Novak,et al.  Scalability comparison of Peer-to-Peer similarity search structures , 2008, Future Gener. Comput. Syst..

[20]  David Novak,et al.  Web-scale system for image similarity search: When the dreams are coming true , 2008, 2008 International Workshop on Content-Based Multimedia Indexing.

[21]  David Novak,et al.  Building a web-scale image similarity search system , 2010, Multimedia Tools and Applications.

[22]  Marco Patella,et al.  Proceedings of the Third International Conference on SImilarity Search and APplications , 2010, SISAP 2010.

[23]  Pavel Zezula,et al.  Approximate similarity retrieval with M-trees , 1998, The VLDB Journal.

[24]  Hanan Samet,et al.  Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling) , 2005 .

[25]  James Aspnes,et al.  Skip graphs , 2003, SODA '03.

[26]  Zhe Wang,et al.  Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.

[27]  Pavel Zezula,et al.  Similarity Grid for Searching in Metric Spaces , 2004, DELOS.

[28]  Andrea Esuli MiPai: Using the PP-Index to Build an Efficient and Scalable Similarity Search System , 2009, 2009 Second International Workshop on Similarity Search and Applications.

[29]  Johannes Gehrke,et al.  Querying peer-to-peer networks using P-trees , 2004, WebDB '04.

[30]  Mayank Bawa,et al.  LSH forest: self-tuning indexes for similarity search , 2005, WWW '05.

[31]  David Novak,et al.  MESSIF: Metric Similarity Search Implementation Framework , 2007, DELOS.

[32]  David Novak,et al.  CoPhIR Image Collection under the Microscope , 2009, 2009 Second International Workshop on Similarity Search and Applications.

[33]  Pasquale Savino,et al.  Approximate similarity search in metric spaces using inverted files , 2008, Infoscale.

[34]  David Salomon,et al.  Computer Graphics and Geometric Modeling , 1999, Springer New York.

[35]  E. Chávez,et al.  Measuring the Dimensionality of General Metric Spaces , 2000 .

[36]  David Novak,et al.  M-Chord: a scalable distributed similarity search structure , 2006, InfoScale '06.