Temporal localization of audio events for conflict monitoring in social media

With the explosion in the availability of user-generated videos documenting any conflicts and human rights abuses around the world, analysts and researchers increasingly find themselves overwhelmed with massive amounts of video data to acquire and analyze useful information. In this paper, we develop a temporal localization framework for intense audio events in videos which addresses the problem. The proposed method utilizes Localized Self-Paced Reranking (LSPaR) to refine the localization results. LSPaR utilizes samples from easy to noisier ones so that it can overcome the noisiness of the initial retrieval results from user-generated videos. We show our framework's efficacy on localizing intense audio event like gunshot, and further experiments also indicate that our methods can be generalized to localizing other audio events in noisy videos.

[1]  Deyu Meng,et al.  What Objective Does Self-paced Learning Indeed Optimize? , 2015, ArXiv.

[2]  Xirong Li,et al.  Detecting semantic concepts in consumer videos using audio , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Deva Ramanan,et al.  Self-Paced Learning for Long-Term Tracking , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Daniel P. W. Ellis,et al.  Audio-Based Semantic Concept Classification for Consumer Video , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Shiguang Shan,et al.  Self-Paced Curriculum Learning , 2015, AAAI.

[6]  John Saunders,et al.  Real-time discrimination of broadcast speech/music , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[7]  Talal Ahmed,et al.  Improving efficiency and reliability of gunshot detection systems , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Sergios Theodoridis,et al.  Gunshot detection in audio streams from movies by means of dynamic programming and Bayesian networks , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Malcolm Slaney,et al.  Construction and evaluation of a robust multifeature speech/music discriminator , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Florian Metze,et al.  Event-based Video Retrieval Using Audio , 2012, INTERSPEECH.

[11]  Alan Hanjalic,et al.  Supervised reranking for web image search , 2010, ACM Multimedia.

[12]  Rong Yan,et al.  Multimedia Search with Pseudo-relevance Feedback , 2003, CIVR.

[13]  Mike E. Davies,et al.  IEEE International Conference on Acoustics Speech and Signal Processing , 2008 .

[14]  Alexander G. Hauptmann,et al.  Video Analytics for Conflict Monitoring and Human Rights Documentation , 2015 .

[15]  Jonathan G. Fiscus,et al.  TRECVID 2016: Evaluating Video Search, Video Event Detection, Localization, and Hyperlinking , 2016, TRECVID.

[16]  Daphne Koller,et al.  Self-Paced Learning for Latent Variable Models , 2010, NIPS.

[17]  Daphne Koller,et al.  Learning specific-class segmentation from diverse data , 2011, 2011 International Conference on Computer Vision.

[18]  Deyu Meng,et al.  Easy Samples First: Self-paced Reranking for Zero-Example Multimedia Search , 2014, ACM Multimedia.

[19]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[20]  Bhiksha Raj,et al.  Audio event detection from acoustic unit occurrence patterns , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Deyu Meng,et al.  Learning to Detect Concepts from Webly-Labeled Video Data , 2016, IJCAI.

[22]  Douglas Keislar,et al.  Content-Based Classification, Search, and Retrieval of Audio , 1996, IEEE Multim..

[23]  Teruko Mitamura,et al.  Zero-Example Event Search using MultiModal Pseudo Relevance Feedback , 2014, ICMR.

[24]  Qiang Wu,et al.  Adapting boosting for information retrieval measures , 2010, Information Retrieval.

[25]  David Grangier,et al.  A Discriminative Kernel-based Approach to Retrieval Images from Text Queries , 2008 .

[26]  Justin Salamon,et al.  A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[27]  Rong Yan,et al.  Video Retrieval Based on Semantic Concepts , 2008, Proceedings of the IEEE.

[28]  Shih-Fu Chang,et al.  Video search reranking via information bottleneck principle , 2006, MM '06.

[29]  Andrew Zisserman,et al.  Efficient Additive Kernels via Explicit Feature Maps , 2012, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  Augusto Sarti,et al.  Scream and gunshot detection and localization for audio-surveillance systems , 2007, 2007 IEEE Conference on Advanced Video and Signal Based Surveillance.