Porn Streamer Recognition in Live Video Streaming via Attention-Gated Multimodal Deep Features

Live video streaming platforms have attracted millions of streamers and daily active users. For profit and popularity accumulation, some streamers mix pornography content into live content to avoid online supervision. Therefore, accurate recognition of porn streamers in live video streaming has become a challenging task. Porn streamers in live video present multimodal characteristics including visual and acoustic content. Therefore, a porn streamer recognition method in live video streaming is proposed that uses attention-gated multimodal deep features. Our contribution includes the following: (1) multimodal deep features, i.e., spatial, motion and audio, are extracted from live video streaming using convolutional neural networks (CNNs), in which the temporal context of multimodal features is obtained with a bi-directional gated recurrent unit (Bi-GRU); (2) the tri-attention gated mechanism is applied to map the associations between different modalities by assigning higher weights to important features for further reduction in the redundancy of multimodal features; (3) porn streamers in live video streaming are recognized via the attention-gated multimodal deep features. Six experiments are conducted on a real-world dataset, and the competitive results demonstrate that our method can effectively recognize porn streamers in live video streaming.

[1]  Ruslan Salakhutdinov,et al.  Gated-Attention Readers for Text Comprehension , 2016, ACL.

[2]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Zhouyu Fu,et al.  Recognition of Pornographic Web Pages by Classifying Texts and Images , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Li Zhuo,et al.  Pornographic image recognition and filtering using incremental learning in compressed domain , 2015, J. Electronic Imaging.

[6]  Jie Li,et al.  Gated Recurrent Unit Based Acoustic Modeling with Future Context , 2018, INTERSPEECH.

[7]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[8]  Jing Wang,et al.  Pornographic images recognition based on spatial pyramid partition and multi-instance ensemble learning , 2015, Knowl. Based Syst..

[9]  Shih-Fu Chang,et al.  Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification , 2017, IEEE Transactions on Multimedia.

[10]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[11]  Qi Li,et al.  Crowdsourcing-Based Copyright Infringement Detection in Live Video Streams , 2018, 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[12]  Constantinos Patsakis,et al.  Adult Content in Social Live Streaming Services: Characterizing Deviant Users and Relationships , 2018, 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[13]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Wen Gao,et al.  Adult Image Detection Method Base-on Skin Color Model and Support Vector Machine , 2001 .

[15]  Baoxin Li,et al.  Multi-stream CNN: Learning representations based on human-related regions for action recognition , 2018, Pattern Recognit..

[16]  Mohamed Moustafa,et al.  Applying deep learning to classify pornographic images and videos , 2015, ArXiv.

[17]  Xin Pan,et al.  A hybrid MLP-CNN classifier for very fine resolution remotely sensed image classification , 2017, ISPRS Journal of Photogrammetry and Remote Sensing.

[18]  Luc Van Gool,et al.  AENet: Learning Deep Audio Features for Video Analysis , 2017, IEEE Transactions on Multimedia.

[19]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[20]  Ming Zhou,et al.  Gated Self-Matching Networks for Reading Comprehension and Question Answering , 2017, ACL.

[21]  Tong Li,et al.  GMM and CNN Hybrid Method for Short Utterance Speaker Recognition , 2018, IEEE Transactions on Industrial Informatics.

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Bo Xu,et al.  Recognition of blue movies by fusion of audio and video , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[24]  Feng Wu,et al.  Background Prior-Based Salient Object Detection via Deep Reconstruction Residual , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[25]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[26]  Andrea Vedaldi,et al.  Transactions on Pattern Analysis and Machine Intelligence 1 Action Recognition with Dynamic Image Networks , 2022 .

[27]  Vanessa Testoni,et al.  Video pornography detection through deep learning techniques and motion information , 2016, Neurocomputing.

[28]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Rodrigo C. Barros,et al.  Adult content detection in videos with convolutional and recurrent neural networks , 2018, Neurocomputing.

[30]  Junwei Han,et al.  PoseFlow: A Deep Motion Representation for Understanding Human Behaviors in Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Vahid Kazemi,et al.  Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering , 2017, ArXiv.

[32]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[33]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Pushpak Bhattacharyya,et al.  Contextual Inter-modal Attention for Multi-modal Sentiment Analysis , 2018, EMNLP.

[35]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[36]  Haoqi Fan,et al.  Stacked Latent Attention for Multimodal Reasoning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Li Zhuo,et al.  ORB feature based web pornographic image recognition , 2016, Neurocomputing.

[39]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Jung-Woo Ha,et al.  Dual Attention Networks for Multimodal Reasoning and Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Xuelong Li,et al.  Detection of Co-salient Objects by Looking Deep and Wide , 2016, International Journal of Computer Vision.

[42]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[43]  Li Zhuo,et al.  An approach of bag-of-words based on visual attention model for pornographic images recognition in compressed domain , 2013, Neurocomputing.

[44]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Xiao Liu,et al.  Multimodal Keyless Attention Fusion for Video Classification , 2018, AAAI.

[46]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[47]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[48]  Arnaldo de Albuquerque Araújo,et al.  A bag-of-features approach based on Hue-SIFT descriptor for nude detection , 2009, 2009 17th European Signal Processing Conference.

[49]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[50]  Jan Kautz,et al.  PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[51]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Xiao Liu,et al.  Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.