A Content-Adaptive Analysis and Representation Framework for Audio Event Discovery from "Unscripted" Multimedia

We propose a content-adaptive analysis and representation framework to discover events using audio features from "unscripted" multimedia such as sports and surveillance for summarization. The proposed analysis framework performs an inlier/outlier-based temporal segmentation of the content. It is motivated by the observation that "interesting" events in unscripted multimedia occur sparsely in a background of usual or "uninteresting" events. We treat the sequence of low/mid-level features extracted from the audio as a time series and identify subsequences that are outliers. The outlier detection is based on eigenvector analysis of the affinity matrix constructed from statistical models estimated from the subsequences of the time series. We define the confidence measure on each of the detected outliers as the probability that it is an outlier. Then, we establish a relationship between the parameters of the proposed framework and the confidence measure. Furthermore, we use the confidence measure to rank the detected outliers in terms of their departures from the background process. Our experimental results with sequences of low- and mid-level audio features extracted from sports video show that "highlight" events can be extracted effectively as outliers from a background process using the proposed framework. We proceed to show the effectiveness of the proposed framework in bringing out suspicious events from surveillance videos without any a priori knowledge. We show that such temporal segmentation into background and outliers, along with the ranking based on the departure from the background, can be used to generate content summaries of any desired length. Finally, we also show that the proposed framework can be used to systematically select "key audio classes" that are indicative of events of interest in the chosen domain.

[1]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[2]  Bruno O. Shubert,et al.  Random variables and stochastic processes , 1979 .

[3]  John G. Proakis,et al.  Probability, random variables and stochastic processes , 1985, IEEE Trans. Acoust. Speech Signal Process..

[4]  William A. Pearlman,et al.  Multirate vector quantization of image pyramids , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[5]  M. C. Jones,et al.  A reliable data-based bandwidth selection method for kernel density estimation , 1991 .

[6]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[7]  Matthew P. Wand,et al.  Kernel Smoothing , 1995 .

[8]  Rainer Lienhart,et al.  Automatic text recognition for video indexing , 1997, MULTIMEDIA '96.

[9]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[10]  Pietro Perona,et al.  A Factorization Approach to Grouping , 1998, ECCV.

[11]  Noboru Babaguchi,et al.  Extracting actors, actions and events from sports video -a fundamental approach to story tracking , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[12]  Shih-Fu Chang,et al.  Determining computable scenes in films and their structures using audio-visual memory models , 2000, ACM Multimedia.

[13]  Zhu Liu,et al.  Multimedia content analysis-using both audio and visual clues , 2000, IEEE Signal Process. Mag..

[14]  Alan Hanjalic,et al.  DANCERS: Delft advanced news retrieval system , 2001, IS&T/SPIE Electronic Imaging.

[15]  Peter J. L. van Beek,et al.  Detection of slow-motion replay segments in sports video for highlights generation , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[16]  Dorin Comaniciu,et al.  The Variable Bandwidth Mean Shift and Data-Driven Scale Selection , 2001, ICCV.

[17]  John Zimmerman,et al.  Integrated multimedia processing for topic segmentation and classification , 2001, Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205).

[18]  John R. Kender,et al.  Video Summaries through Mosaic-Based Shot and Scene Clustering , 2002, ECCV.

[19]  C.-C. Jay Kuo,et al.  Content-based video analysis, indexing and representation using multimodal information , 2003 .

[20]  Shih-Fu Chang,et al.  Unsupervised Mining of Statistical Temporal Structures in Video , 2003 .

[21]  Shih-Fu Chang,et al.  A statistical framework for fusing mid-level perceptual features in news story segmentation , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[22]  A. Murat Tekalp,et al.  Automatic soccer video analysis and summarization , 2003, IEEE Trans. Image Process..

[23]  Edward Y. Chang,et al.  Multi-camera spatio-temporal fusion and biased sequence-data learning for security surveillance , 2003, MULTIMEDIA '03.

[24]  Mohan S. Kankanhalli,et al.  Creating audio keywords for event detection in soccer video , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[25]  C.-C. Jay Kuo,et al.  Video Content Analysis Using Multimodal Information , 2003, Springer US.

[26]  Regunathan Radhakrishnan,et al.  Audio events detection based highlights extraction from baseball, golf and soccer games in a unified framework , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[27]  Regunathan Radhakrishnan,et al.  Effective and efficient sports highlights extraction using the minimum description length criterion in selecting GMM structures , 2004, ICME.

[28]  Ziyou Xiong,et al.  Effective and efficient sports highlights extraction using the minimum description length criterion in selecting GMM structures [audio classification] , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[29]  Regunathan Radhakrishnan,et al.  Modeling sports highlights using a time-series clustering framework and model interpretation , 2005, IS&T/SPIE Electronic Imaging.

[30]  H. Durrant-Whyte,et al.  Rich probabilistic representations for bearing only decentralised data fusion , 2005, 2005 7th International Conference on Information Fusion.

[31]  Ziyou Xiong,et al.  9.2 – A Unified Framework for Video Summarization, Browsing, and Retrieval , 2005 .

[32]  Regunathan Radhakrishnan,et al.  A Unified Framework for Video Summarization, Browsing, and Retrieval , 2006 .