Beyond audio and video retrieval: towards multimedia summarization

Given the deluge of multimedia content that is becoming available over the Internet, it is increasingly important to be able to effectively examine and organize these large stores of information in ways that go beyond browsing or collaborative filtering. In this paper we review previous work on audio and video processing, and define the task of Topic-Oriented Multimedia Summarization (TOMS) using natural language generation: given a set of automatically extracted features from a video (such as visual concepts and ASR transcripts) a TOMS system will automatically generate a paragraph of natural language ("a recounting"), which summarizes the important information in a video belonging to a certain topic area, and provides explanations for why a video was matched and retrieved. We see this as a first step towards systems that will be able to discriminate visually similar, but semantically different videos, compare two videos and provide textual output or summarize a large number of videos at once. In this paper, we introduce our approach of solving the TOMS problem. We extract visual concept features and ASR transcription features from a given video, and develop a template-based natural language generation system to produce a textual recounting based on the extracted features. We also propose possible experimental designs for continuously evaluating and improving TOMS systems, and present results of a pilot evaluation of our initial system.

[1]  Robin Valenza SUMMARISATION OF SPOKEN AUDIO THROUGH INFORMATION EXTRACTION , 1999 .

[2]  Sadaoki Furui,et al.  Speech Summarization: An Approach through Word Extraction and a Method for Evaluation , 2004, IEICE Trans. Inf. Syst..

[3]  Michael G. Christel Evaluation and user studies with respect to video summarization and browsing , 2006, Electronic Imaging.

[4]  Ani Nenkova,et al.  Summarization evaluation for text and speech: issues and approaches , 2006, INTERSPEECH.

[5]  Ba Tu Truong,et al.  Video abstraction: A systematic review and classification , 2007, TOMCCAP.

[6]  Wei-Hao Lin,et al.  Clever clustering vs. simple speed-up for summarizing rushes , 2007, TVS '07.

[7]  A. Black,et al.  M OUNTAIN : A Translation-based Approach to Natural Language Generation for Dialog Systems , 2009 .

[8]  Gary Marchionini,et al.  Multimedia surrogates for video gisting: Toward combining spoken words and imagery , 2009, Inf. Process. Manag..

[9]  Peter Kolb,et al.  Experiments on the difference between semantic similarity and relatedness , 2009, NODALIDA.

[10]  Michael G. Christel Automated Metadata in Multimedia Information Systems: Creation, Refinement, Use in Surrogates, and Evaluation , 2009, Automated Metadata in Multimedia Information Systems.

[11]  Joan A. Smith,et al.  Robust , Light-weight Approaches to compute Lexical Similarity , 2010 .

[12]  Yongdong Zhang,et al.  Explicit and implicit concept-based video retrieval with bipartite graph propagation model , 2010, ACM Multimedia.

[13]  Bernard Mérialdo,et al.  Multi-video summarization based on AV-MMR , 2010, 2010 International Workshop on Content Based Multimedia Indexing (CBMI).

[14]  Florian Metze,et al.  Informedia @ TRECVID 2011 , 2011 .

[15]  Chong-Wah Ngo,et al.  Towards textually describing complex video contents with audio-visual concept classifiers , 2011, ACM Multimedia.

[16]  Yasuo Kuniyoshi,et al.  Understanding images with natural sentences , 2011, MM '11.