Scalable and Efficient Spatial-Aware Parallelization Strategies for Multimedia Retrieval

Similarity search is a key operation in several multimedia applications, including online Content-Based Multimedia Retrieval (CBMR) services. These applications have to deal with very large databases and are submitted to high query rates. In this context, scalability in distributed memory system is critical to assemble the required computing power and memory space. However, we have identified that the Data Equal Split (DES) parallelization and associated data partition strategy employed by the related works on the domain have limitations in terms of efficiency and scalability. Therefore, in this paper, we developed and implemented a framework for similarity search execution on distributed memory machines and proposed a novel class of data partition strategies that takes into account the data spatial organization in its distribution. This approach leads to a reduction in communication traffic and in costs associated with processing each task in local searches carried out in the distributed machine. Our approach attained a speedup of 2.4× on top of DES in the baseline case (5 nodes) and also achieves higher scalability efficiency and is 14.5× faster when 160 nodes are used. In fact, our novel data organization led to superlinear scalability in all configurations evaluated.

[1]  Laurent Amsaleg,et al.  Prototyping a Web-Scale Multimedia Retrieval Service Using Spark , 2018, ACM Trans. Multim. Comput. Commun. Appl..

[2]  Pradeep Dubey,et al.  Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing , 2013, Proc. VLDB Endow..

[3]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[4]  Ashish Goel,et al.  Efficient distributed locality sensitive hashing , 2012, CIKM.

[5]  Sebastian Michel,et al.  RankReduce - Processing K-Nearest Neighbor Queries on Top of MapReduce , 2010, LSDS-IR@SIGIR.

[6]  David G. Lowe,et al.  Scalable Nearest Neighbor Algorithms for High Dimensional Data , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.

[8]  Guadalupe Canahuate,et al.  High-dimensional similarity searches using query driven dynamic quantization and distributed indexing , 2019, Distributed and Parallel Databases.

[9]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[10]  Christian Böhm,et al.  Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases , 2001, CSUR.

[11]  Laurent Amsaleg,et al.  Indexing and searching 100M images with map-reduce , 2013, ICMR.

[12]  George Teodoro,et al.  Large-scale parallel similarity search with Product Quantization for online multimedia services , 2019, J. Parallel Distributed Comput..

[13]  Martin Krulis,et al.  Combining CPU and GPU architectures for fast similarity search , 2012, Distributed and Parallel Databases.

[14]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Joel H. Saltz,et al.  Approximate similarity search for online multimedia services on distributed CPU–GPU platforms , 2012, The VLDB Journal.

[16]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.