Online Data Organizer: Micro-Video Categorization by Structure-Guided Multimodal Dictionary Learning

Micro-videos have rapidly become one of the most dominant trends in the era of social media. Accordingly, how to organize them draws our attention. Distinct from the traditional long videos that would have multi-site scenes and tolerate the hysteresis, a micro-video: 1) usually records contents at one specific venue within a few seconds. The venues are structured hierarchically regarding their category granularity. This motivates us to organize the micro-videos via their venue structure. 2) timely circulates over social networks. Thus, the timeliness of micro-videos desires effective online processing. However, only 1.22% of micro-videos are labeled with venue information when uploaded at the mobile end. To address this problem, we present a framework to organize the micro-videos online. In particular, we first build a structure-guided multi-modal dictionary learning model to learn the concept-level micro-video representation by jointly considering their venue structure and modality relatedness. We then develop an online learning algorithm to incrementally and efficiently strengthen our model, as well as categorize the micro-videos into a tree structure. Extensive experiments on a real-world data set validate our model well. In addition, we have released the codes to facilitate the research in the community.

[1]  Trevor Darrell,et al.  Multimodal location estimation , 2010, ACM Multimedia.

[2]  Asok Ray,et al.  Multimodal Task-Driven Dictionary Learning for Image Classification , 2015, IEEE Transactions on Image Processing.

[3]  Tao Mei,et al.  Towards Cross-Domain Learning for Social Video Popularity Prediction , 2013, IEEE Transactions on Multimedia.

[4]  Aron Culotta,et al.  Predicting the Demographics of Twitter Users from Website Traffic Data , 2015, AAAI.

[5]  Yueting Zhuang,et al.  Multi-modal Mutual Topic Reinforce Modeling for Cross-media Retrieval , 2014, ACM Multimedia.

[6]  Kiyoharu Aizawa,et al.  Degree of loop assessment in microvideo , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[7]  Baoxin Li,et al.  Discriminative K-SVD for dictionary learning in face recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[8]  Xiaojun Wan,et al.  Emotion Classification in Microblog Texts Using Class Sequential Rules , 2014, AAAI.

[9]  Donghui Wang,et al.  A Dictionary Learning Approach for Classification: Separating the Particularity and the Commonality , 2012, ECCV.

[10]  Xian-Sheng Hua,et al.  Video search re-ranking via multi-graph propagation , 2007, ACM Multimedia.

[11]  Adam L. Janin,et al.  Multimodal location estimation on Flickr videos , 2011, WSM '11.

[12]  Chong Wang,et al.  Simultaneous image classification and annotation , 2009, CVPR.

[13]  Changsheng Xu,et al.  Real time advertisement insertion in baseball video based on advertisement effect , 2005, MULTIMEDIA '05.

[14]  Xiao-Yuan Jing,et al.  Uncorrelated Multi-View Discrimination Dictionary Learning for Recognition , 2014, AAAI.

[15]  Tao Mei,et al.  VideoSense: towards effective online video advertising , 2007, ACM Multimedia.

[16]  Cong Wang,et al.  Towards Efficient Privacy-preserving Image Feature Extraction in Cloud Computing , 2014, ACM Multimedia.

[17]  Henry A. Kautz,et al.  Predicting Disease Transmission from Geo-Tagged Micro-Blog Data , 2012, AAAI.

[18]  Guillermo Sapiro,et al.  Online dictionary learning for sparse coding , 2009, ICML '09.

[19]  Meng Liu,et al.  Attentive Moment Retrieval in Videos , 2018, SIGIR.

[20]  Yueting Zhuang,et al.  Cross-media semantic representation via bi-directional learning to rank , 2013, ACM Multimedia.

[21]  Thomas S. Huang,et al.  Image Super-Resolution Via Sparse Representation , 2010, IEEE Transactions on Image Processing.

[22]  Jaeyoung Choi,et al.  Human vs machine: establishing a human baseline for multimodal location estimation , 2013, ACM Multimedia.

[23]  Rong Yan,et al.  Semantic concept-based query expansion and re-ranking for multimedia retrieval , 2007, ACM Multimedia.

[24]  Zi Huang,et al.  Spatial-aware Multimodal Location Estimation for Social Images , 2015, ACM Multimedia.

[25]  Yixin Zhong,et al.  Simultaneous image classification and annotation based on probabilistic model , 2012 .

[26]  Michael Elad,et al.  Sparse Representation for Color Image Restoration , 2008, IEEE Transactions on Image Processing.

[27]  Rossano Schifanella,et al.  6 Seconds of Sound and Vision: Creativity in Micro-videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Guillermo Sapiro,et al.  Supervised Dictionary Learning , 2008, NIPS.

[29]  Stefan Winkler,et al.  Inferring Painting Style with Multi-Task Dictionary Learning , 2015, IJCAI.

[30]  LinLin Shen,et al.  Analysis-Synthesis Dictionary Learning for Universality-Particularity Representation Based Classification , 2016, AAAI.

[31]  Chris H. Q. Ding,et al.  Robust Non-Negative Dictionary Learning , 2014, AAAI.

[32]  Luming Zhang,et al.  Interest Inference via Structure-Constrained Multi-Source Multi-Task Learning , 2015, IJCAI.

[33]  Tat-Seng Chua,et al.  Shorter-is-Better: Venue Category Estimation from Micro-Video , 2016, ACM Multimedia.

[34]  Francis R. Bach,et al.  Consistency of the group Lasso and multiple kernel learning , 2007, J. Mach. Learn. Res..

[35]  Scott E. Hudson,et al.  Low disturbance audio for awareness and privacy in media space applications , 1995, MULTIMEDIA '95.

[36]  Xiaoqin Zhang,et al.  Semi-Supervised Dictionary Learning via Structural Sparse Preserving , 2016, AAAI.

[37]  Yi Yang,et al.  Beyond Doctors: Future Health Prediction from Multimedia and Multimodal Observations , 2015, ACM Multimedia.

[38]  Quan Pan,et al.  Semi-coupled dictionary learning with applications to image super-resolution and photo-sketch synthesis , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Charless C. Fowlkes,et al.  The Open World of Micro-Videos , 2016, ArXiv.

[40]  Tat-Seng Chua,et al.  Micro Tells Macro: Predicting the Popularity of Micro-Videos via a Transductive Model , 2016, ACM Multimedia.

[41]  Jieping Ye,et al.  Multi-Task Feature Learning Via Efficient l2, 1-Norm Minimization , 2009, UAI.

[42]  Yiannis Kompatsiaris,et al.  Sensing Trending Topics in Twitter , 2013, IEEE Transactions on Multimedia.

[43]  Nicu Sebe,et al.  Complex Event Detection via Event Oriented Dictionary Learning , 2015, AAAI.

[44]  Jean Ponce,et al.  Task-Driven Dictionary Learning , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Xiaoou Tang,et al.  Image Super-Resolution Using Deep Convolutional Networks , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Marie-Francine Moens,et al.  Vision and Language Integration Meets Multimedia Fusion , 2018, IEEE Multim..

[47]  Ming Zhou,et al.  Exacting Social Events for Tweets Using a Factor Graph , 2012, AAAI.

[48]  Ying Wu,et al.  Self-Supervised Learning for Visual Tracking and Recognition of Human Hand , 2000, AAAI/IAAI.

[49]  Michael Elad,et al.  Image Denoising Via Sparse and Redundant Representations Over Learned Dictionaries , 2006, IEEE Transactions on Image Processing.