Highlights Extraction from Unscripted Video

This chapter presents a sports highlights extraction framework, which is built on a hierarchical representation that includes play/break segmentation, audio-visual marker detection, audio-visual marker association, and finer-resolution highlight classification. It decomposes the semantic and subjective concepts of “sports highlights” to events at different layers. The key component in this framework is the detection of audio and visual objects that serve as the bridge between the observed video signal and the semantics. It is a deviation from the “feature extraction + classification” paradigm for multimedia modeling, especially when the features are global features such as color histograms. Visual object detection also uses image features, but these features represent localized features, and spatial configuration of these local features. The experimental results have confirmed the advantage of this approach. This chapter has reported the results of sports highlights extraction based on audio classification and the correlation between the applause/cheering sound with exciting moments.