Real-time automated video highlight generation with dual-stream hierarchical growing self-organizing maps

Video has rapidly become one of the most common sources of visual information transfer. The number of videos uploaded to YouTube in a single day is estimated to take over 82 years to watch. Automated tools and techniques for analyzing and understanding video content, thus, have become an essential requirement. This paper addresses the problem of video highlight generation for large video files. We propose a novel skimming-based unsupervised video highlight generation method utilizing statistical image processing and data clustering, which process frame-level static and dynamic features of input video in two streams. The dynamic feature stream is represented by computing a dense optical flow for each consecutive frame, providing instantaneous velocity information for every pixel, which is then characterized by a per-frame orientation histogram, weighted by the norm, with orientations quantized. To process multi-scene videos, we utilize the divisive hierarchical clustering capability of growing self-organizing map (GSOM) using a dual-step top-down hierarchical approach in which the first level consists of clustering of spatial and temporal features of the video and in the second level, each parent cluster is hierarchically subdivided into child clusters using GSOM. The video highlight generation process is conducted real time by evaluating segments of video snippets based on a pre-defined time interval. We demonstrate the accuracy, robustness and the quality of highlights generated using a qualitative analysis conducted using 1625 human experts on highlights generated from two datasets. Further, we conduct a runtime analysis to demonstrate the efficient processing capability of the proposed method, to be used in real-time settings.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Joachim Weickert,et al.  Lucas/Kanade Meets Horn/Schunck: Combining Local and Global Optic Flow Methods , 2005, International Journal of Computer Vision.

[3]  Eric P. Xing,et al.  Unsupervised Object-Level Video Summarization with Online Motion Auto-Encoder , 2018, Pattern Recognit. Lett..

[4]  In-So Kweon,et al.  Discriminative Feature Learning for Unsupervised Video Summarization , 2018, AAAI.

[5]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[6]  Michael J. Black,et al.  On the Spatial Statistics of Optical Flow , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[7]  Teuvo Kohonen,et al.  The self-organizing map , 1990, Neurocomputing.

[8]  Ke Niu,et al.  Video highlight extraction via content-aware deep transfer , 2019, Multimedia Tools and Applications.

[9]  Xinghuo Yu,et al.  Incremental knowledge acquisition and self-learning for autonomous video surveillance , 2017, IECON 2017 - 43rd Annual Conference of the IEEE Industrial Electronics Society.

[10]  Ke Zhang,et al.  Video Summarization with Long Short-Term Memory , 2016, ECCV.

[11]  A. Tashk,et al.  Age and gender estimation by using hybrid facial features , 2012, 2012 20th Telecommunications Forum (TELFOR).

[12]  Su Nguyen,et al.  Online Incremental Machine Learning Platform for Big Data-Driven Smart Traffic Management , 2019, IEEE Transactions on Intelligent Transportation Systems.

[13]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[14]  Agus Harjoko,et al.  Grid-based Histogram of Oriented Optical Flow for analyzing movements on video data , 2015, 2015 International Conference on Data and Software Engineering (ICoDSE).

[15]  Hichem Snoussi,et al.  Histograms of Optical Flow Orientation for Visual Abnormal Events Detection , 2012, 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance.

[16]  Minyi Guo,et al.  Unsupervised Extraction of Video Highlights via Robust Recurrent Auto-Encoders , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Wei Jiang,et al.  A novel compact yet rich key frame creation method for compressed video summarization , 2017, Multimedia Tools and Applications.

[18]  Su Nguyen,et al.  Artificial intelligence based commuter behaviour profiling framework using Internet of things for real-time decision-making , 2020, Neural Computing and Applications.

[19]  D. Ruderman,et al.  Independent component analysis of natural image sequences yields spatio-temporal filters similar to simple cells in primary visual cortex , 1998, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[20]  Xinghuo Yu,et al.  Hierarchical Two-Stream Growing Self-Organizing Maps With Transience for Human Activity Recognition , 2020, IEEE Transactions on Industrial Informatics.

[21]  Lihong Zheng,et al.  Facial expression recognition using hybrid features and self-organizing maps , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[22]  Xinghuo Yu,et al.  Spatiotemporal Anomaly Detection Using Deep Learning for Real-Time Video Surveillance , 2020, IEEE Transactions on Industrial Informatics.

[23]  Ying Cai,et al.  Multiscale overlapping blocks binarized statistical image features descriptor with flip-free distance for face verification in the wild , 2017, Neural Computing and Applications.

[24]  Navjot Singh,et al.  SOMES: An Efficient SOM Technique for Event Summarization in Multi-view Surveillance Videos , 2018 .

[25]  Mehran Yazdi,et al.  An Efficient Training Procedure for Viola-Jones Face Detector , 2017, 2017 International Conference on Computational Science and Computational Intelligence (CSCI).

[26]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[27]  Michael J. Black,et al.  Robust dynamic motion estimation over time , 1991, Proceedings. 1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[28]  Mohammed Javed,et al.  An efficient method for video shot boundary detection and keyframe extraction using SIFT-point distribution histogram , 2016, International Journal of Multimedia Information Retrieval.

[29]  Ba Tu Truong,et al.  Video abstraction: A systematic review and classification , 2007, TOMCCAP.

[30]  Hujun Yin,et al.  The Self-Organizing Maps: Background, Theories, Extensions and Applications , 2008, Computational Intelligence: A Compendium.

[31]  Junehwa Song,et al.  A narrative-based abstraction framework for story-oriented video , 2007, TOMCCAP.

[32]  Patrick Bouthemy,et al.  Non parametric motion recognition using temporal multiscale Gibbs models , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[33]  Mehran Yazdi,et al.  Robust cascaded skin detector based on AdaBoost , 2018, Multimedia Tools and Applications.

[34]  Kristen Grauman,et al.  Diverse Sequential Subset Selection for Supervised Video Summarization , 2014, NIPS.

[35]  Charles Ringer,et al.  Deep unsupervised multi-view detection of video game stream highlights , 2018, FDG.

[36]  T Michael Moses,et al.  A Deterministic Key-Frame Indexing and Selection for Surveillance Video Summarization , 2019, 2019 International Conference on Data Science and Communication (IconDSC).

[37]  Peter Gärdenfors,et al.  First and second order dynamics in a hierarchical SOM system for action recognition , 2017, Appl. Soft Comput..

[38]  Min-Woong Sohn,et al.  Distance and cosine measures of niche overlap , 2001, Soc. Networks.

[39]  Berthold K. P. Horn,et al.  Determining Optical Flow , 1981, Other Conferences.

[40]  Habib Rostami,et al.  Distributed random cooperation for VBF-based routing in high-speed dense underwater acoustic sensor networks , 2018, The Journal of Supercomputing.

[41]  Bala Srinivasan,et al.  Dynamic self-organizing maps with controlled growth for knowledge discovery , 2000, IEEE Trans. Neural Networks Learn. Syst..

[42]  Huiru Zheng,et al.  Human Activity Detection in Smart Home Environment with Self-Adaptive Neural Networks , 2008, 2008 IEEE International Conference on Networking, Sensing and Control.

[43]  Gunhee Kim,et al.  A Deep Ranking Model for Spatio-Temporal Highlight Detection from a 360 Video , 2018, AAAI.

[44]  Kate Smith-Miles,et al.  HDGSOMr: a high dimensional growing self-organizing map using randomness for efficient Web and text mining , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[45]  Naveen K. Chilamkurti,et al.  Self-evolving intelligent algorithms for facilitating data interoperability in IoT environments , 2018, Future Gener. Comput. Syst..

[46]  Xinghuo Yu,et al.  HT-GSOM: Dynamic Self-organizing Map with Transience for Human Activity Recognition , 2019, 2019 IEEE 17th International Conference on Industrial Informatics (INDIN).