Summarization of videotaped presentations: automatic analysis of motion and gesture

This paper presents an automatic system for analyzing and annotating video sequences of technical talks. Our method uses a robust motion estimation technique to detect key frames and segment the video sequence into subsequences containing a single overhead slide. The subsequences are stabilized to remove motion that occurs when the speaker adjusts their slides. Any changes remaining between frames in the stabilized sequences may be due to speaker gestures such as pointing or writing, and we use active contours to automatically track these potential gestures. Given the constrained domain, we define a simple set of actions that can be recognized based on the active contour shape and motion. The recognized actions provide an annotation of the sequence that can be used to access a condensed version of the talk from a Web page.

[1]  Shih-Fu Chang,et al.  Clustering methods for video browsing and annotation , 1996, Electronic Imaging.

[2]  J. Gibson The Ecological Approach to Visual Perception , 1979 .

[3]  Wolfgang Effelsberg,et al.  Video abstracting , 1997, CACM.

[4]  Liming Chen,et al.  Multi-criteria video segmentation for TV news , 1997, Proceedings of First Signal Processing Society Workshop on Multimedia Signal Processing.

[5]  Michael A. Smith,et al.  Video skimming and characterization through the combination of image and language understanding techniques , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  Aaron F. Bobick,et al.  A state-based technique for the summarization and recognition of gesture , 1995, Proceedings of IEEE International Conference on Computer Vision.

[7]  Hans-Hellmut Nagel,et al.  Association of Motion Verbs with Vehicle Movements Extracted from Dense Optical Flow Fields , 1994, ECCV.

[8]  Boon-Lock Yeo,et al.  Analysis And Presentation Of Soccer Highlights From Digital Video , 1995 .

[9]  Matthew Brand,et al.  Understanding manipulation in video , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[10]  Michael Isard,et al.  A mixed-state condensation tracker with automatic model-switching , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[11]  Yoshinobu Tonomura,et al.  Projection-detecting filter for video cut detection , 1994, MULTIMEDIA '93.

[12]  Gudula Retz-Schmidt,et al.  A REPLAI of SOCCER: Recognizing Intentions in the Domain of Soccer Games , 1988, European Conference on Artificial Intelligence.

[13]  Boon-Lock Yeo,et al.  Rapid scene analysis on compressed video , 1995, IEEE Trans. Circuits Syst. Video Technol..

[14]  Hanno Scharr,et al.  Study of Dynamical Processes with Tensor-Based Spatiotemporal Image Processing Techniques , 1998, ECCV.

[15]  Scott L. Minneman,et al.  A confederation of tools for capturing and accessing collaborative activity , 1995, MULTIMEDIA '95.

[16]  Yihong Gong,et al.  Automatic parsing of news video , 1994, 1994 Proceedings of IEEE International Conference on Multimedia Computing and Systems.

[17]  Jeffrey Mark Siskind,et al.  A Maximum-Likelihood Approach to Visual Event Classification , 1996, ECCV.

[18]  Boon-Lock Yeo,et al.  Extracting story units from long programs for video browsing and navigation , 1996, Proceedings of the Third IEEE International Conference on Multimedia Computing and Systems.

[19]  Hiroshi Murase,et al.  Video shot analysis using efficient multiple object tracking , 1997, Proceedings of IEEE International Conference on Multimedia Computing and Systems.

[20]  Edward H. Adelson,et al.  Layered representation for motion analysis , 1993, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Michael J. Black,et al.  The Robust Estimation of Multiple Motions: Parametric and Piecewise-Smooth Flow Fields , 1996, Comput. Vis. Image Underst..

[22]  Boon-Lock Yeo,et al.  Video browsing using clustering and scene transitions on compressed sequences , 1995, Electronic Imaging.

[23]  Allan D. Jepson,et al.  Computational Perception of Scene Dynamics , 1996, ECCV.

[24]  Aaron F. Bobick,et al.  Closed-world tracking , 1995, Proceedings of IEEE International Conference on Computer Vision.

[25]  Thomas Rist,et al.  On the Simultaneous Interpretation of Real World Image Sequences and their Natural Language Description: The System Soccer , 1988, ECAI.

[26]  Ramin Zabih,et al.  Video browsing using edges and motion , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[27]  Boon-Lock Yeo,et al.  Video visualization for compact presentation and fast browsing of pictorial content , 1997, IEEE Trans. Circuits Syst. Video Technol..

[28]  Osamu Hori,et al.  A shot classification method of selecting effective key-frames for video browsing , 1997, MULTIMEDIA '96.

[29]  ZhangHongJiang,et al.  Automatic partitioning of full-motion video , 1993 .

[30]  Yihong Gong,et al.  Video parsing using compressed data , 1994, Electronic Imaging.

[31]  Michael J. Black,et al.  Recognizing temporal trajectories using the condensation algorithm , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.