Prototyping a Web-Scale Multimedia Retrieval Service Using Spark

The world has experienced phenomenal growth in data production and storage in recent years, much of which has taken the form of media files. At the same time, computing power has become abundant with multi-core machines, grids, and clouds. Yet it remains a challenge to harness the available power and move toward gracefully searching and retrieving from web-scale media collections. Several researchers have experimented with using automatically distributed computing frameworks, notably Hadoop and Spark, for processing multimedia material, but mostly using small collections on small computing clusters. In this article, we describe a prototype of a (near) web-scale throughput-oriented MM retrieval service using the Spark framework running on the AWS cloud service. We present retrieval results using up to 43 billion SIFT feature vectors from the public YFCC 100M collection, making this the largest high-dimensional feature vector collection reported in the literature. We also present a publicly available demonstration retrieval system, running on our own servers, where the implementation of the Spark pipelines can be observed in practice using standard image benchmarks, and downloaded for research purposes. Finally, we describe a method to evaluate retrieval quality of the ever-growing high-dimensional index of the prototype, without actually indexing a web-scale media collection.

[1]  David A. Shamma,et al.  YFCC100M , 2015, Commun. ACM.

[2]  Gylfi Þór Guðmundsson,et al.  Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark , 2017, MMSys.

[3]  Said Jai-Andaloussi,et al.  Medical content based image retrieval by using the Hadoop framework , 2013, ICT 2013.

[4]  Hong Zheng,et al.  Massive Medical Images Retrieval System Based on Hadoop , 2014, J. Multim..

[5]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[6]  Laurent Amsaleg,et al.  Balancing clusters to reduce response time variability in large scale image search , 2010, 2011 9th International Workshop on Content-Based Multimedia Indexing (CBMI).

[7]  Laurent Amsaleg,et al.  NV-Tree: nearest neighbors at the billion scale , 2011, ICMR '11.

[8]  Yang Gao,et al.  A Content-Based Image Retrieval System Based on Hadoop and Lucene , 2012, 2012 Second International Conference on Cloud and Green Computing.

[9]  Cordelia Schmid,et al.  Hamming Embedding and Weak Geometric Consistency for Large Scale Image Search , 2008, ECCV.

[10]  Anthony K. H. Tung,et al.  SINGA: A Distributed Deep Learning Platform , 2015, ACM Multimedia.

[11]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[12]  Ting Liu,et al.  Clustering Billions of Images with Large Scale Nearest Neighbor Search , 2007, 2007 IEEE Workshop on Applications of Computer Vision (WACV '07).

[13]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[14]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[15]  Andrew Zisserman,et al.  All About VLAD , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Sean Owen,et al.  Mahout in Action , 2011 .

[17]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[18]  Changhu Wang,et al.  Indexing billions of images for sketch-based retrieval , 2013, ACM Multimedia.

[19]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[20]  Wichian Premchaiswadi,et al.  Improving performance of content-based image retrieval schemes using Hadoop MapReduce , 2013, 2013 International Conference on High Performance Computing & Simulation (HPCS).

[21]  Jun Wu,et al.  Accelerating Large-scale Image Retrieval on Heterogeneous Architectures with Spark , 2015, ACM Multimedia.

[22]  Laurent Amsaleg,et al.  A Database Perspective on Large Scale High-Dimensional Indexing , 2014 .

[23]  Nathan Marz,et al.  Big Data: Principles and best practices of scalable realtime data systems , 2015 .

[24]  R. Manimegalai,et al.  Medical Image Retrieval System in Grid Using Hadoop Framework , 2014, 2014 International Conference on Computational Science and Computational Intelligence.

[25]  Laurent Amsaleg,et al.  Scalable high-dimensional indexing with Hadoop , 2013, 2013 11th International Workshop on Content-Based Multimedia Indexing (CBMI).

[26]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[27]  Edward Y. Chang,et al.  Foundations of Large-Scale Multimedia Information Management and Retrieval: Mathematics of Perception , 2011 .

[28]  Michael I. Jordan,et al.  SparkNet: Training Deep Networks in Spark , 2015, ICLR.

[29]  David Novak,et al.  Building a web-scale image similarity search system , 2010, Multimedia Tools and Applications.

[30]  Jonathon S. Hare,et al.  ImageTerrier: an extensible platform for scalable high-performance image retrieval , 2012, ICMR.

[31]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Michael Isard,et al.  Lost in quantization: Improving particular object retrieval in large scale image databases , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Jimmy J. Lin,et al.  Web-scale computer vision using MapReduce for multimedia data mining , 2010, MDMKDD '10.

[34]  Jing Zhang,et al.  DIRS: Distributed image retrieval system based on MapReduce , 2010, 5th International Conference on Pervasive Computing and Applications.

[35]  Laurent Amsaleg,et al.  Terabyte-scale image similarity search: Experience and best practice , 2013, 2013 IEEE International Conference on Big Data.

[36]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[39]  Victor S. Lempitsky,et al.  The Inverted Multi-Index , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Laurent Amsaleg,et al.  Indexing and searching 100M images with map-reduce , 2013, ICMR.