Auto-summarization of audio-video presentations

As streaming audio-video technology becomes widespread, there is a dramatic increase in the amount of multimedia content available on the net. Users face a new challenge: How to examine large amounts of multimedia content quickly. One technique that can enable quick overview of multimedia is video summaries; that is, a shorter version assembled by picking important segments from the original. We evaluate three techniques for automatic creation of summaries for online audio-video presentations. These techniques exploit information in the audio signal (e.g., pitch and pause information), knowledge of slide transition points in the presentation, and information about access patterns of previous users. We report a user study that compares automatically generated summaries that are 20%-25% the length of full presentations to author generated summaries. Users learn from the computer-generated summaries, although less than from authors' summaries. They initially find computer-generated summaries less coherent, but quickly grow accustomed to them.

[1]  Barry Arons,et al.  SpeechSkimmer: a system for interactively skimming recorded speech , 1997, TCHI.

[2]  Michael A. Smith,et al.  Video skimming and characterization through the combination of image and language understanding techniques , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[3]  W. Cooper,et al.  Declination of fundamental frequency in speakers' production of parenthetical and main clauses. , 1983, The Journal of the Acoustical Society of America.

[4]  Barry Arons,et al.  VoiceNotes: a speech interface for a hand-held voice notetaker , 1993, INTERCHI.

[5]  Robert W. Donaldson,et al.  Adaptive silence deletion for speech storage and voice mail applications , 1988, IEEE Trans. Acoust. Speech Signal Process..

[6]  Anoop Gupta,et al.  Time-compression: systems concerns, usage, and benefits , 1999, CHI '99.

[7]  Michael G. Christel,et al.  Evolving video skims into useful multimedia abstractions , 1998, CHI.

[8]  Paul Resnick,et al.  Recommender systems , 1997, CACM.

[9]  Dragutin Petkovic,et al.  Key to effective video retrieval: effective cataloging and browsing , 1998, MULTIMEDIA '98.

[10]  S. Abe,et al.  Content oriented visual interface using video icons for visual database systems , 1989, [Proceedings] 1989 IEEE Workshop on Visual Languages.

[11]  Remi Depommier,et al.  Content-based browsing of video sequences , 1994, MULTIMEDIA '94.

[12]  Andreas Girgensohn,et al.  An intelligent media browser using automatic multimodal analysis , 1998, MULTIMEDIA '98.

[13]  Osamu Hori,et al.  A shot classification method of selecting effective key-frames for video browsing , 1997, MULTIMEDIA '96.

[14]  Barry Arons Pitch-based emphasis detection for segmenting speech recordings , 1994, ICSLP.

[15]  Stephen W. Smoliar,et al.  Video parsing, retrieval and browsing: an integrated and content-based solution , 1997, MULTIMEDIA '95.

[16]  Francine R. Chen,et al.  The use of emphasis to automatically summarize a spoken discourse , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Mark T. Maybury,et al.  Broadcast news navigation using story segmentation , 1997, MULTIMEDIA '97.

[18]  Wolfgang Effelsberg,et al.  Video abstracting , 1997, CACM.

[19]  G W Heiman,et al.  Word intelligibility decrements and the comprehension of time-compressed speech , 1986, Perception & psychophysics.

[20]  Eyal Yair,et al.  Super resolution pitch determination of speech signals , 1991, IEEE Trans. Signal Process..

[21]  Barry Arons,et al.  The audio notebook: paper and pen interaction with structured speech , 2001, CHI.

[22]  Barry Arons,et al.  Techniques, Perception, and Applications of Time-Compressed Speech , 2009 .

[23]  Julia Hirschberg,et al.  Intonational Features of Local and Global Discourse Structure , 1992, HLT.

[24]  Barry Arons,et al.  Interactively skimming recorded speech , 1994 .

[25]  Michael J. Black,et al.  Analysis of gesture and action in technical talks for video indexing , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.