Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark

Computing power has now become abundant with multi-core machines, grids and clouds, but it remains a challenge to harness the available power and move towards gracefully handling web-scale datasets. Several researchers have used automatically distributed computing frameworks, notably Hadoop and Spark, for processing multimedia material, but mostly using small collections on small clusters. In this paper, we describe the engineering process for a prototype of a (near) web-scale multimedia service using the Spark framework running on the AWS cloud service. We present experimental results using up to 43 billion SIFT feature vectors from the public YFCC 100M collection, making this the largest high-dimensional feature vector collection reported in the literature. The design of the prototype and performance results demonstrate both the flexibility and scalability of the Spark framework for implementing multimedia services.

[1]  Jing Zhang,et al.  DIRS: Distributed image retrieval system based on MapReduce , 2010, 5th International Conference on Pervasive Computing and Applications.

[2]  Laurent Amsaleg,et al.  Terabyte-scale image similarity search: Experience and best practice , 2013, 2013 IEEE International Conference on Big Data.

[3]  Changhu Wang,et al.  Indexing billions of images for sketch-based retrieval , 2013, ACM Multimedia.

[4]  Anthony K. H. Tung,et al.  SINGA: A Distributed Deep Learning Platform , 2015, ACM Multimedia.

[5]  Jonathon S. Hare,et al.  ImageTerrier: an extensible platform for scalable high-performance image retrieval , 2012, ICMR.

[6]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[7]  Victor S. Lempitsky,et al.  The Inverted Multi-Index , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Laurent Amsaleg,et al.  Indexing and searching 100M images with map-reduce , 2013, ICMR.

[9]  Yang Gao,et al.  A Content-Based Image Retrieval System Based on Hadoop and Lucene , 2012, 2012 Second International Conference on Cloud and Green Computing.

[11]  Ting Liu,et al.  Clustering Billions of Images with Large Scale Nearest Neighbor Search , 2007, 2007 IEEE Workshop on Applications of Computer Vision (WACV '07).

[12]  Jun Wu,et al.  Accelerating Large-scale Image Retrieval on Heterogeneous Architectures with Spark , 2015, ACM Multimedia.

[13]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Michael I. Jordan,et al.  SparkNet: Training Deep Networks in Spark , 2015, ICLR.

[15]  Laurent Amsaleg,et al.  A Database Perspective on Large Scale High-Dimensional Indexing , 2014 .

[16]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Edward Y. Chang,et al.  Foundations of Large-Scale Multimedia Information Management and Retrieval: Mathematics of Perception , 2011 .

[18]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[19]  R. Manimegalai,et al.  Medical Image Retrieval System in Grid Using Hadoop Framework , 2014, 2014 International Conference on Computational Science and Computational Intelligence.

[20]  Laurent Amsaleg,et al.  Scalable high-dimensional indexing with Hadoop , 2013, 2013 11th International Workshop on Content-Based Multimedia Indexing (CBMI).

[21]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Michael Isard,et al.  Lost in quantization: Improving particular object retrieval in large scale image databases , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Laurent Amsaleg,et al.  NV-Tree: nearest neighbors at the billion scale , 2011, ICMR '11.

[24]  Jimmy J. Lin,et al.  Web-scale computer vision using MapReduce for multimedia data mining , 2010, MDMKDD '10.

[25]  DongSheng Yin,et al.  Content-Based Image Retrial Based on Hadoop , 2013 .

[26]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[27]  Said Jai-Andaloussi,et al.  Medical content based image retrieval by using the Hadoop framework , 2013, ICT 2013.

[28]  Cordelia Schmid,et al.  Hamming Embedding and Weak Geometric Consistency for Large Scale Image Search , 2008, ECCV.

[29]  Hong Zheng,et al.  Massive Medical Images Retrieval System Based on Hadoop , 2014, J. Multim..

[30]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[31]  David A. Shamma,et al.  The New Data and New Challenges in Multimedia Research , 2015, ArXiv.

[32]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[33]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[34]  Andrew Zisserman,et al.  All About VLAD , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[36]  Wichian Premchaiswadi,et al.  Improving performance of content-based image retrieval schemes using Hadoop MapReduce , 2013, 2013 International Conference on High Performance Computing & Simulation (HPCS).

[37]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[38]  Sean Owen,et al.  Mahout in Action , 2011 .