PETA: Photo Albums Event Recognition using Transformers Attention

In recent years the amounts of personal photos captured increased significantly, giving rise to new challenges in multi-image understanding and high-level image understanding. Event recognition in personal photo albums presents one challenging scenario where life events are recognized from a disordered collection of images, including both relevant and irrelevant images. Event recognition in images also presents the challenge of high-level image understanding, as opposed to low-level image object classification. In absence of methods to analyze multiple inputs, previous methods adopted temporal mechanisms, including various forms of recurrent neural networks. However, their effective temporal window is local. In addition, they are not a natural choice given the disordered characteristic of photo albums. We address this gap with a tailor-made solution, combining the power of CNNs for image representation and transformers for album representation to perform global reasoning on image collection, offering a practical and efficient solution for photo albums event recognition. Our solution reaches state-of-the-art results on 3 prominent benchmarks, achieving above 90% mAP on all datasets. We further explore the related image-importance task in event recognition, demonstrating how the learned attentions correlate with the human-annotated importance for this subjective task, thus opening the door for new applications.1

[1]  Lihi Zelnik-Manor,et al.  ImageNet-21K Pretraining for the Masses , 2021, NeurIPS Datasets and Benchmarks.

[2]  Bin Zhu,et al.  Graph Neural Networks for Image Understanding Based on Multiple Cues: Group Emotion Recognition and Event Recognition as Use Cases , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[3]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[4]  Bolei Zhou,et al.  Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Thomas S. Huang,et al.  Album-based object-centric event recognition , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[6]  Lihi Zelnik-Manor,et al.  Asymmetric Loss For Multi-Label Classification , 2020, ArXiv.

[7]  Lihi Zelnik-Manor,et al.  An Image is Worth 16x16 Words, What is a Video Worth? , 2021, ArXiv.

[8]  Tao Mei,et al.  Multigranular Event Recognition of Personal Photo Albums , 2018, IEEE Transactions on Multimedia.

[9]  Fei-Fei Li,et al.  What, where and who? Classifying events by scene and object recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[10]  Petia Radeva,et al.  Smartphone picture organization: A hierarchical approach , 2018, Comput. Vis. Image Underst..

[11]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[12]  Matthieu Guillaumin,et al.  Event Recognition in Photo Collections with a Stopwatch HMM , 2013, 2013 IEEE International Conference on Computer Vision.

[13]  Itamar Friedman,et al.  TResNet: High Performance GPU-Dedicated Architecture , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[14]  Radomír Mech,et al.  Recognizing and Curating Photo Albums via Event-Specific Image Importance , 2017, BMVC.

[15]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Ashish Vaswani,et al.  Stand-Alone Self-Attention in Vision Models , 2019, NeurIPS.

[17]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[18]  Radomír Mech,et al.  Event-Specific Image Importance , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Dahua Lin,et al.  Recognize complex events from static images by fusing deep channels , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Samy Ait-Aoudia,et al.  A probabilistic topic model for event-based image classification and multi-label annotation , 2019, Signal Process. Image Commun..

[22]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[23]  Andrey Savchenko,et al.  Event Recognition with Automatic Album Detection based on Sequential Grouping of Confidence Scores and Neural Attention , 2020, 2020 International Joint Conference on Neural Networks (IJCNN).

[24]  Xinmei Tian,et al.  Event recognition in personal photo collections using hierarchical model and multiple features , 2015, 2015 IEEE 17th International Workshop on Multimedia Signal Processing (MMSP).

[25]  Xiaoshan Yang,et al.  Discriminative multimodal embedding for event classification , 2020, Neurocomputing.

[26]  Haeju Park,et al.  Meta-supervision for Attention Using Counterfactual Estimation , 2020, Data Science and Engineering.

[27]  Francesco G. B. De Natale,et al.  A hierarchical approach to event discovery from single images using MIL framework , 2016, 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[28]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[29]  Nicola Conci,et al.  How Deep Features Have Improved Event Recognition in Multimedia , 2019, ACM Trans. Multim. Comput. Commun. Appl..

[30]  Cordelia Schmid,et al.  ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Chen Sun,et al.  Complex Event Recognition from Images with Few Training Examples , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[32]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[33]  Ralph Ewerth,et al.  Ontology-driven Event Type Classification in Images , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[34]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[35]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Farid Melgani,et al.  Ensemble of Deep Models for Event Recognition , 2018, ACM Trans. Multim. Comput. Commun. Appl..