Multigranular Event Recognition of Personal Photo Albums

People are taking more photos than ever before in recent years. To effectively organize these personal photos, the photos are usually assigned to albums according to their events. An efficient way to manage our photos would be if we could recognize the events of the albums automatically. In this paper, we study the problem of recognizing events in personal photo albums. Recognizing events in photo albums is a new challenge since the contents of photos in albums are more complicated than in traditional single-photo tasks, since not all photos in an album are relevant to the event and a single photo in an album often fails to convey the meaningful event semantic behind the album. To solve this problem, we introduce an attention network to learn the representations of photo albums. Then, we adopt a hierarchical model to recognize events from coarse to fine using multigranular features. We evaluate our model on two real-world datasets consisting of personal albums; we find that our model achieves promising results.

[1]  Liang Wang,et al.  Learning Representative Deep Features for Image Set Analysis , 2015, IEEE Transactions on Multimedia.

[2]  Céline Hudelot,et al.  Hierarchical image annotation using semantic hierarchies , 2012, CIKM.

[3]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[4]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Brendan J. Frey,et al.  Non-metric affinity propagation for unsupervised image categorization , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[6]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[7]  Tao Mei,et al.  Relaxing from Vocabulary: Robust Weakly-Supervised Deep Learning for Vocabulary-Free Image Tagging , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[8]  Peter Kontschieder,et al.  Deep Neural Decision Forests , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[9]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[10]  Yu-Gang Jiang,et al.  SUPER: towards real-time event recognition in internet videos , 2012, ICMR.

[11]  Mubarak Shah,et al.  Recognizing Complex Events Using Large Margin Joint Low-Level Event Model , 2012, ECCV.

[12]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Jiebo Luo,et al.  Annotating photo collections by label propagation according to multiple similarity cues , 2008, ACM Multimedia.

[14]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[15]  Daniel R. Tretter,et al.  Event classification for personal photo collections , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Thomas S. Huang,et al.  Album-based object-centric event recognition , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[17]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[18]  Qi Tian,et al.  Large-scale video copy retrieval with temporal-concentration SIFT , 2016, Neurocomputing.

[19]  Amaia Salvador,et al.  Cultural Event recognition with visual ConvNets and temporal models , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[20]  Bolei Zhou,et al.  Object Detectors Emerge in Deep Scene CNNs , 2014, ICLR.

[21]  Robinson Piramuthu,et al.  HD-CNN: Hierarchical Deep Convolutional Neural Networks for Large Scale Visual Recognition , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[22]  Nicu Sebe,et al.  Exploitation of time constraints for (sub-)event recognition , 2011, J-MRE '11.

[23]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[24]  Matthieu Guillaumin,et al.  Event Recognition in Photo Collections with a Stopwatch HMM , 2013, 2013 IEEE International Conference on Computer Vision.

[25]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[26]  Luc Van Gool,et al.  Action snippets: How many frames does human action recognition require? , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[28]  Xinmei Tian,et al.  Event recognition in personal photo collections using hierarchical model and multiple features , 2015, 2015 IEEE 17th International Workshop on Multimedia Signal Processing (MMSP).

[29]  Tao Mei,et al.  Action Recognition by Learning Deep Multi-Granular Spatio-Temporal Video Representation , 2016, ICMR.

[30]  Fei-Fei Li,et al.  What, where and who? Classifying events by scene and object recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[31]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[32]  Tao Mei,et al.  Concurrent Multiple Instance Learning for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Yiannis Kompatsiaris,et al.  Cluster-Based Landmark and Event Detection for Tagged Photo Collections , 2011, IEEE MultiMedia.