Design and evaluation of a music video summarization system

We present a system that summarizes the textual, audio, and video information of music videos in a format tuned to the preferences of a focus group of 20 users. First, we analyzed user-needs for the content and the layout of the music summaries. Then, we designed algorithms that segment individual song videos from full music video programs by noting changes in color palette, transcript, and audio classification. We summarize each song with automatically selected high level information such as title, artist, duration, title frame, and text as well as audio and visual segments of the chorus. Our system automatically determines with high recall and precision chorus locations, from the placement of repeated words and phrases in the text of the song's lyrics. Our Bayesian belief network then selects other significant video and audio content from the multiple media. Overall, we are able to compress content by a factor of 10. Our second user study has identified the principal variations between users in their choices of content desired in the summary, and in their choices of the platforms that should support their viewing.

[1]  Beth Logan,et al.  Music summarization using key phrases , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[2]  Amarnath Gupta,et al.  Visual information retrieval , 1997, CACM.

[3]  Matthew Cooper,et al.  Summarizing popular music via structural similarity analysis , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[4]  Masataka Goto,et al.  A chorus-section detecting method for musical audio signals , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[5]  Ajay Divakaran,et al.  A novel pair-wise comparison based analytical framework for automatic measurement of intensity of motion activity of video segments , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[6]  Nevenka Dimitrova,et al.  Video Clustering Using SuperHistograms in Large Archives , 2000, VISUAL.

[7]  Shih-Fu Chang,et al.  A utility framework for the automatic generation of audio-visual skims , 2002, MULTIMEDIA '02.

[8]  Lalitha Agnihotri,et al.  Summarization of video programs based on closed captions , 2000, IS&T/SPIE Electronic Imaging.