Blackthorn: Large-Scale Interactive Multimodal Learning

This paper presents Blackthorn, an efficient interactive multimodal learning approach facilitating analysis of multimedia collections of up to 100 million items on a single high-end workstation. Blackthorn features efficient data compression, feature selection, and optimizations to the interactive learning process. The Ratio-64 data representation introduced in this paper only costs tens of bytes per item yet preserves most of the visual and textual semantic information with good accuracy. The optimized interactive learning model scores the Ratio-64-compressed data directly, greatly reducing the computational requirements. The experiments compare Blackthorn with two baselines: Conventional relevance feedback, and relevance feedback using product quantization to compress the features. The results show that Blackthorn is up to 77.5$\times$ faster than the conventional relevance feedback alternative, while outperforming the baseline with respect to the relevance of results: It vastly outperforms the baseline on recall over time and reaches up to 108% of its precision. Compared to the product quantization variant, Blackthorn is just as fast, while producing more relevant results. On the full YFCC100M dataset, Blackthorn performs one complete interaction round in roughly 1 s while maintaining adequate relevance of results, thus opening multimedia collections comprising up to 100 million items to fully interactive learning-based analysis.

[1]  Bolei Zhou,et al.  Places: An Image Database for Deep Scene Understanding , 2016, ArXiv.

[2]  David J. Fleet,et al.  Fast search in Hamming space with multi-index hashing , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Grigorios Tsoumakas,et al.  A Comprehensive Study Over VLAD and Product Quantization in Large-Scale Image Retrieval , 2014, IEEE Transactions on Multimedia.

[4]  Adriana Kovashka,et al.  WhittleSearch: Interactive Image Search with Relative Attribute Feedback , 2015, International Journal of Computer Vision.

[5]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Mohammad Soleymani,et al.  Automatic tagging and geotagging in video collections and communities , 2011, ICMR.

[7]  Marcel Worring,et al.  Interactive Multimodal Learning on 100 Million Images , 2016, ICMR.

[8]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[9]  Andreas Dengel,et al.  Real-time Analysis and Visualization of the YFCC100m Dataset , 2015, MMCommons '15.

[10]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[11]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[12]  Pradeep Dubey,et al.  Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing , 2013, Proc. VLDB Endow..

[13]  Guillermo Sapiro,et al.  Sparse Representation for Computer Vision and Pattern Recognition , 2010, Proceedings of the IEEE.

[14]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Marcel Worring,et al.  Analytic Quality: Evaluation of Performance and Insight in Multimedia Collection Analysis , 2015, ACM Multimedia.

[16]  Laurent Amsaleg,et al.  Indexing and searching 100M images with map-reduce , 2013, ICMR.

[17]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[18]  Svetlana Lazebnik,et al.  Asymmetric Distances for Binary Embeddings , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Meng Wang,et al.  Spectral Hashing With Semantically Consistent Graph for Image Indexing , 2013, IEEE Transactions on Multimedia.

[20]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[22]  Edward Y. Chang,et al.  Active Learning for Interactive Multimedia Retrieval , 2008, Proceedings of the IEEE.

[23]  Marcel Worring,et al.  VideOlympics: Real-Time Evaluation of Multimedia Retrieval Systems , 2008, IEEE MultiMedia.

[24]  Klaus Schöffmann,et al.  A User-Centric Media Retrieval Competition: The Video Browser Showdown 2012-2014 , 2014, IEEE Multim..

[25]  Laurent Amsaleg,et al.  NV-Tree: nearest neighbors at the billion scale , 2011, ICMR '11.

[26]  Jaeyoung Choi,et al.  The Placing Task at MediaEval 2015 , 2015, MediaEval.

[27]  Stevan Rudinac,et al.  Leveraging visual concepts and query performance prediction for semantic-theme-based video retrieval , 2012, International Journal of Multimedia Information Retrieval.

[28]  Moni Naor,et al.  Rank aggregation methods for the Web , 2001, WWW '01.

[29]  Nicu Sebe,et al.  Fisher Kernel Temporal Variation-based Relevance Feedback for video retrieval , 2016, Comput. Vis. Image Underst..

[30]  David A. Shamma,et al.  YFCC100M , 2015, Commun. ACM.

[31]  Larry S. Davis,et al.  SHOE: Sibling Hashing with Output Embeddings , 2015, ACM Multimedia.

[32]  Gylfi Þór Guðmundsson,et al.  Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark , 2017, MMSys.

[33]  Djoerd Hiemstra,et al.  Beyond Shot Retrieval: Searching for Broadcast News Items Using Language Models of Concepts , 2010, ECIR.

[34]  Takeo Kanade,et al.  Intelligent Access to Digital Video: Informedia Project , 1996, Computer.

[35]  C. V. Jawahar,et al.  Diverse Yet Efficient Retrieval using Locality Sensitive Hashing , 2016, ICMR.

[36]  Jaeyoung Choi,et al.  Kickstarting the Commons: The YFCC100M and the YLI Corpora , 2015, MMCommons '15.

[37]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[38]  Nenghai Yu,et al.  Optimized Distances for Binary Code Ranking , 2014, ACM Multimedia.

[39]  Yongdong Zhang,et al.  Topology preserving hashing for similarity search , 2013, MM '13.

[40]  E. M. Wright,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[41]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[42]  Yi Yang,et al.  Fast and Accurate Content-based Semantic Search in 100M Internet Videos , 2015, ACM Multimedia.

[43]  Eamonn J. Keogh Nearest Neighbor , 2010, Encyclopedia of Machine Learning.

[44]  Marcel Worring,et al.  Active Bucket Categorization for High Recall Video Retrieval , 2013, IEEE Transactions on Multimedia.

[45]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[46]  Marcel Worring,et al.  Towards interactive, intelligent, and integrated multimedia analytics , 2014, 2014 IEEE Conference on Visual Analytics Science and Technology (VAST).

[47]  Yannis Avrithis,et al.  Locally Optimized Product Quantization for Approximate Nearest Neighbor Search , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Chris North,et al.  Toward measuring visualization insight , 2006, IEEE Computer Graphics and Applications.