Severe complexity constraints on consumer electronic devices motivate us to investigate general-purpose video summarization techniques that are able to apply a common hardware setup to multiple content genres. On the other hand, we know that high quality summaries can only be produced with domain-specific processing. In this paper, we present a time-series analysis based video summarization technique that provides a general core to which we are able to add small content-specific extensions for each genre. The proposed time-series analysis technique consists of unsupervised clustering of samples taken through sliding windows from the time series of features obtained from the content. We classify content into two broad categories, scripted content such as news and drama, and unscripted content such as sports and surveillance. The summarization problem then reduces to finding either finding semantic boundaries of the scripted content or detecting highlights in the unscripted content. The proposed technique is essentially an event detection technique and is thus best suited to unscripted content, however, we also find applications to scripted content. We thoroughly examine the trade-off between content-neutral and content-specific processing for effective summarization for a number of genres, and find that our core technique enables us to minimize the complexity of the content-specific processing and to postpone it to the final stage. We achieve the best results with unscripted content such as sports and surveillance video in terms of quality of summaries and minimizing content-specific processing. For other genres such as drama, we find that more content-specific processing is required. We also find that judicious choice of key audio-visual object detectors enables us to minimize the complexity of the content-specific processing while maintaining its applicability to a broad range of genres. We will present a demonstration of our proposed technique at the conference.
[1]
Shih-Fu Chang,et al.
A statistical framework for fusing mid-level perceptual features in news story segmentation
,
2003,
2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).
[2]
John R. Kender,et al.
Video Summaries through Mosaic-Based Shot and Scene Clustering
,
2002,
ECCV.
[3]
C.-C. Jay Kuo,et al.
Content-based video analysis, indexing and representation using multimodal information
,
2003
.
[4]
Shih-Fu Chang,et al.
Unsupervised Mining of Statistical Temporal Structures in Video
,
2003
.
[5]
Regunathan Radhakrishnan,et al.
A Content-Adaptive Analysis and Representation Framework for Audio Event Discovery from "Unscripted" Multimedia
,
2006,
EURASIP J. Adv. Signal Process..
[6]
Rainer Lienhart,et al.
Automatic text recognition for video indexing
,
1997,
MULTIMEDIA '96.
[7]
Regunathan Radhakrishnan,et al.
A Unified Framework for Video Summarization, Browsing, and Retrieval
,
2006
.