Segmentation, structure detection and summarization of multimedia sequences

This thesis investigates the problem of efficiently summarizing audio-visual sequences. The problem is important since consumers now have access to vast amounts of multimedia content, that can be viewed over a range of devices. The goal of this thesis is to be able to provide an adaptive framework for automatically generating a short multimedia clip as a summary, when given longer multimedia segment as input to the system. In our framework, the solution to the summarization problem is predicated on the solution to three important sub-problems—segmentation, structure detection and audio-visual condensation of the data. In the segmentation problem, we focus on the determination of computable scenes. These are segments of audio-visual data that are consistent with respect to certain low-level properties and which preserve the syntax of the original video. This work does not address the problem of semantics of the segments, since this is not a well posed problem. There are three novel ideas in our approach: (a) analysis of the effects of rules of production on the data; (b) a finite, causal memory model for segmenting audio and video and (c) the use of top-down structural grouping rules that enable us to be consistent with human perception. These scenes form the input to our condensation algorithm. In the problem of detecting structure, we propose a novel framework that analyzes the topology of the sequence. In our work, we will limit our scope to discrete, temporal structures that have a priori known deterministic generative mechanisms. We show two general approaches to solving the problem, and we shall present robust algorithms for detecting two specific visual structures—the dialog and the regular anchor. We propose a novel entity-utility framework for the problem of condensing audio-visual segments. The idea is that the multimedia sequence can be thought of as comprising entities, a subset of which will satisfy the users information needs. We associate a utility to these entities, and formulate the problem of preserving the entities required by the user as a convex utility maximization problem with constraints. The framework allows for adaptability to changing device and other resource conditions. Other original contributions include—(a) the idea that comprehension of a shot is related to its visual complexity; (b) the idea that the preservation of visual syntax is necessary for the generation of coherent multimedia summaries; (c) auditory analysis that uses discourse structure and (d) novel multimedia synchronization requirements. We conducted user studies using the multimedia summary clips generated by the system. These user studies indicate that the summaries are perceived as coherent at condensation rates as high as 90%. The study also revealed that the measurable improvements over competing algorithms were statistically significant.

[1]  Barry Arons,et al.  Pitch-based emphasis detection for segmenting speech recordings , 1994, ICSLP.

[2]  Stephen W. Smoliar,et al.  Video parsing, retrieval and browsing: an integrated and content-based solution , 1997, MULTIMEDIA '95.

[3]  Q. Summerfield Book Review: Auditory Scene Analysis: The Perceptual Organization of Sound , 1992 .

[4]  Wolfgang Effelsberg,et al.  Automatic audio content analysis , 1997, MULTIMEDIA '96.

[5]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[6]  Svetha Venkatesh,et al.  Automated film rhythm extraction for scene analysis , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[7]  Shih-Fu Chang,et al.  Constrained utility maximization for generating visual skims , 2001, Proceedings IEEE Workshop on Content-Based Access of Image and Video Libraries (CBAIVL 2001).

[8]  N. Meyers,et al.  H = W. , 1964, Proceedings of the National Academy of Sciences of the United States of America.

[9]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[10]  Martin Cooke,et al.  Modelling auditory processing and organisation , 1993, Distinguished dissertations in computer science.

[11]  Boon-Lock Yeo,et al.  Analysis And Presentation Of Soccer Highlights From Digital Video , 1995 .

[12]  Shih-Fu Chang,et al.  Structure analysis of sports video using domain models , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[13]  Giridharan Iyengar,et al.  Content-based browsing and editing of unstructured video , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[14]  D. O'Shaughnessy,et al.  Recognition of hesitations in spontaneous speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Boon-Lock Yeo,et al.  Time-constrained clustering for segmentation of video into story units , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[16]  Dragutin Petkovic,et al.  Towards robust features for classifying audio in the CueVideo system , 1999, MULTIMEDIA '99.

[17]  Dragutin Petkovic,et al.  Key to effective video retrieval: effective cataloging and browsing , 1998, MULTIMEDIA '98.

[18]  Boon-Lock Yeo,et al.  Video content characterization and compaction for digital library applications , 1997, Electronic Imaging.

[19]  Boon-Lock Yeo,et al.  Rapid scene analysis on compressed video , 1995, IEEE Trans. Circuits Syst. Video Technol..

[20]  Jeho Nam,et al.  Combined audio and visual streams analysis for video sequence segmentation , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21]  Kevin M. Brooks Do story agents use rocking chairs? The theory and implementation of one model for computational narrative , 1997, MULTIMEDIA '96.

[22]  D. Ruppert Robust Statistics: The Approach Based on Influence Functions , 1987 .

[23]  HongJiang Zhang,et al.  Automatic parsing of TV soccer programs , 1995, Proceedings of the International Conference on Multimedia Computing and Systems.

[24]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[25]  Lisa J. Stifelman A Discourse Analysis Approach to Structured Speech , 1995 .

[26]  Anoop Gupta,et al.  Comparing presentation summaries: slides vs. reading vs. listening , 2000, CHI.

[27]  Tanveer F. Syeda-Mahmood,et al.  Learning video browsing behavior and its application in the generation of video previews , 2001, MULTIMEDIA '01.

[28]  Julia Hirschberg,et al.  Some intonational characteristics of discourse structure , 1992, ICSLP.

[29]  Zhu Liu,et al.  Integration of audio and visual information for content-based video segmentation , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[30]  Wolfgang Effelsberg,et al.  Automatic Movie Abstracting , 1997 .

[31]  Boon-Lock Yeo,et al.  Classification, simplification, and dynamic visualization of scene transition graphs for video browsing , 1997, Electronic Imaging.

[32]  Bin Yu,et al.  Model Selection and the Principle of Minimum Description Length , 2001 .

[33]  John Zimmerman,et al.  A probabilistic layered framework for integrating multimedia content and context information , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[34]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[35]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[36]  Wolfgang Effelsberg,et al.  Abstracting Digital Movies Automatically , 1996, J. Vis. Commun. Image Represent..

[37]  Barry Arons,et al.  Interactively skimming recorded speech , 1994 .

[38]  Shih-Fu Chang Optimal Video Adaptation and Skimming Using a Utility-Based Framework , 2002 .

[39]  Boon-Lock Yeo,et al.  Segmentation of Video by Clustering and Graph Analysis , 1998, Comput. Vis. Image Underst..

[40]  Malcolm Slaney,et al.  Construction and evaluation of a robust multifeature speech/music discriminator , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[41]  Shih-Fu Chang,et al.  Determining computable scenes in films and their structures using audio-visual memory models , 2000, ACM Multimedia.

[42]  Akio Nagasaka,et al.  Automatic structure visualization for video editing , 1993, INTERCHI.

[43]  John R. Kender,et al.  Video scene segmentation via continuous video coherence , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[44]  Werner A. Stahel,et al.  Robust Statistics: The Approach Based on Influence Functions , 1987 .

[45]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[46]  Jacob Feldman,et al.  Minimization of Boolean complexity in human concept learning , 2000, Nature.

[47]  Alan Hanjalic,et al.  Automated high-level movie segmentation for advanced video-retrieval systems , 1999, IEEE Trans. Circuits Syst. Video Technol..

[48]  Zhu Liu,et al.  Joint video scene segmentation and classification based on hidden Markov model , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[49]  Victor Lesser,et al.  Blackboard systems for knowledge-based signal understanding , 1992 .

[50]  Nevenka Dimitrova,et al.  Color superhistograms for video representation , 1999, Proceedings 1999 International Conference on Image Processing (Cat. 99CH36348).

[51]  Shih-Fu Chang,et al.  Condensing computable scenes using visual complexity and film syntax analysis , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[52]  Ramez Elmasri,et al.  Fundamentals of database systems (2nd ed.) , 1994 .

[53]  Ishwar K. Sethi,et al.  Classification of general audio data for content-based retrieval , 2001, Pattern Recognit. Lett..

[54]  Shih-Fu Chang,et al.  Algorithms and system for segmentation and structure analysis in soccer video , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[55]  William H. Press,et al.  Numerical Recipes in C, 2nd Edition , 1992 .

[56]  Shahram Ebadollahi,et al.  Echocardiogram videos: summarization, temporal segmentation and browsing , 2002, Proceedings. International Conference on Image Processing.

[57]  Julia Hirschberg,et al.  A Prosodic Analysis of Discourse Segments in Direction-Giving Monologues , 1996, ACL.

[58]  Shingo Uchihashi,et al.  Video Manga: generating semantically meaningful video summaries , 1999, MULTIMEDIA '99.

[59]  Barry Arons,et al.  SpeechSkimmer: interactively skimming recorded speech , 1993, UIST '93.

[60]  Shih-Fu Chang,et al.  Computable scenes and structures in films , 2002, IEEE Trans. Multim..

[61]  K. Ramchandran,et al.  A factor graph framework for semantic indexing and retrieval in video , 2000, 2000 Proceedings Workshop on Content-based Access of Image and Video Libraries.

[62]  Teresa H. Meng,et al.  A perceptually based audio signal model with application to scalable audio compression , 1999 .

[63]  Anoop Gupta,et al.  Automatically extracting highlights for TV Baseball programs , 2000, ACM Multimedia.

[64]  Michael G. Christel,et al.  Evolving video skims into useful multimedia abstractions , 1998, CHI.

[65]  Anoop Gupta,et al.  User Benefits of Non-Linear Time Compression , 2000 .

[66]  R. Patterson,et al.  Complex Sounds and Auditory Images , 1992 .

[67]  Shih-Fu Chang,et al.  Structure analysis of soccer video with hidden Markov models , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[68]  C.-C. Jay Kuo,et al.  Heuristic approach for generic audio data segmentation and annotation , 1999, MULTIMEDIA '99.

[69]  David S. Doermann,et al.  Video summarization by curve simplification , 1998, MULTIMEDIA '98.

[70]  Anoop Gupta,et al.  Auto-summarization of audio-video presentations , 1999, MULTIMEDIA '99.

[71]  Daniel Patrick Whittlesey Ellis,et al.  Prediction-driven computational auditory scene analysis , 1996 .

[72]  Brendan J. Frey,et al.  Probabilistic multimedia objects (multijects): a novel approach to video indexing and retrieval in multimedia systems , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[73]  Richard J. Qian,et al.  Detecting semantic events in soccer games: towards a complete solution , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[74]  Lalitha Agnihotri,et al.  Summarization of video programs based on closed captions , 2000, IS&T/SPIE Electronic Imaging.

[75]  Riccardo Leonardi,et al.  Identification of story units in audio-visual sequences by joint audio and video processing , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[76]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[77]  Yukinobu Taniguchi,et al.  PanoramaExcerpts: extracting and packing panoramas for video browsing , 1997, MULTIMEDIA '97.

[78]  Albert S. Bregman,et al.  Auditory scene analysis : hearing in complex environments , 1993 .

[79]  Guy J. Brown Computational auditory scene analysis : a representational approach , 1993 .

[80]  Julia Hirschberg,et al.  Empirical Studies on the Disambiguation of Cue Phrases , 1993, Comput. Linguistics.

[81]  Shih-Fu Chang,et al.  Concepts and Techniques for Indexing Visual Semantics , 2002 .

[82]  William H. Press,et al.  Numerical recipes in C (2nd ed.): the art of scientific computing , 1992 .

[83]  John Zimmerman,et al.  Integrated multimedia processing for topic segmentation and classification , 2001, Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205).

[84]  Paul M. B. Vitányi,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 1993, Graduate Texts in Computer Science.

[85]  Robert Tansley The multimedia thesaurus : adding a semantic layer to multimedia information , 2000 .

[86]  Shih-Fu Chang,et al.  Audio scene segmentation using multiple features, models and time scales , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[87]  Shih-Fu Chang,et al.  IMKA: a multimedia organization system combining perceptual and semantic knowledge , 2001, MULTIMEDIA '01.

[88]  Lie Lu,et al.  A robust audio classification and segmentation method , 2001, MULTIMEDIA '01.

[89]  Shih-Fu Chang,et al.  A knowledge engineering approach for image classification based on probabilistic reasoning systems , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[90]  David C. Gibbon,et al.  Automated authoring of hypermedia documents of video programs , 1995, MULTIMEDIA '95.

[91]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[92]  David Bordwell,et al.  Film Art: An Introduction , 1979 .

[93]  John Saunders,et al.  Real-time discrimination of broadcast speech/music , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.