Video annotation: the role of specialist text

Digital video is among the most information-intensive modes of communication. The retrieval of video from digital libraries, along with sound and text, is a major challenge for the computing community in general and for the artificial intelligence community specifically. The advent of digital video has set some old questions in a new light. Questions relating to aesthetics and to the role of surrogates - image for reality and text for image, invariably touch upon the link between vision and language. Dealing with this link computationally is important for the artificial intelligence enterprise. Interesting images to consider both aesthetically and for research in video retrieval include those which are constrained and patterned, and which convey rich meanings; for example, dance. These are specialist images for us and require a special language for description and interpretation. Furthermore, they require specialist knowledge to be understood since there is usually more than meets the untrained eye: this knowledge may also be articulated in the language of the specialism. In order to be retrieved effectively and efficiently, video has to be annotated-, particularly so for specialist moving images. Annotation involves attaching keywords from the specialism along with, for us, commentaries produced by experts, including those written and spoken specifically for annotation and those obtained from a corpus of extant texts. A system that processes such collateral text for video annotation should perhaps be grounded in an understanding of the link between vision and language. This thesis attempts to synthesise ideas from artificial intelligence, multimedia systems, linguistics, cognitive psychology and aesthetics. The link between vision and language is explored by focusing on moving images of dance and the special language used to describe and interpret them. We have developed an object-oriented system, KAB, which helps to annotate a digital video library with a collateral corpus of texts and terminology. User evaluation has been encouraging. The system is now available on the WWW.

[1]  K. A. Ericsson,et al.  Protocol Analysis: Verbal Reports as Data , 1984 .

[2]  Marc Davis,et al.  Media Streams: an iconic visual language for video representation , 1995 .

[3]  Ramesh C. Jain,et al.  Video Data Management Systems: Metadata and Architecture , 1998, Multimedia Data Management.

[4]  Avron Barr,et al.  The Handbook of Artificial Intelligence , 1982 .

[5]  Sara Shatford,et al.  Analyzing the Subject of a Picture: A Theoretical Approach , 1986 .

[6]  Cyril W. Beaumont The Ballet Called Swan Lake , 1982 .

[7]  Dragutin Petkovic,et al.  Query by Image and Video Content: The QBIC System , 1995, Computer.

[8]  Katsumi Tanaka,et al.  OVID: Design and Implementation of a Video-Object Database System , 1993, IEEE Trans. Knowl. Data Eng..

[9]  Jeannett Martin,et al.  Writing Science: Literacy And Discursive Power , 1993 .

[10]  H Schachlbauer,et al.  EBU/SMPTE Task Force for harmonized standards for the exchange of program material as bit streams - Final report: Analysis and results , 1998 .

[11]  Kenneth B. Haase,et al.  FramerD: Representing Knowledge in the Large , 1996, IBM Syst. J..

[12]  Steve Young,et al.  The video mail retrieval project: experiences in retrieving spoken documents , 1997 .

[13]  Juan C Sager,et al.  English Special Languages: Principles and Practice in Science and Technology , 1980 .

[14]  Fillia Makedon,et al.  Cross-modal information retrieval , 1999 .

[15]  Susan Conrad,et al.  Corpus Linguistics: Investigating Language Structure and Use , 1998 .

[16]  A Pentland Content-based indexing of images and video. , 1997, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[17]  Shih-Fu Chang,et al.  Visually Searching the Web for Content , 1997, IEEE Multim..

[18]  Qian Huang,et al.  Multimedia Search and Retrieval , 1999 .

[19]  Takeo Kanade,et al.  Techniques for the Creation and Exploration of Digital Video Libraries , 1996 .

[20]  A F Bobick,et al.  Movement, activity and action: the role of knowledge in the perception of motion. , 1997, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[21]  Yorick Wilks,et al.  Book Reviews: Electric Words: Dictionaries, Computers, and Meanings , 1996, CL.

[22]  Yihong Gong,et al.  A Generic Video Parsing System With a Scene Description Language (SDL) , 1996, Real Time Imaging.

[23]  Masahiro Shibata,et al.  Content-Based Video Indexing and Retrieval : A Natural Language Approach (Special Issue on Multimedia Computing and Communications) , 1996 .

[24]  Alberto Del Bimbo,et al.  Multi-Perspective Navigation of Movies , 1996, J. Vis. Lang. Comput..

[25]  Craig A. Lindley,et al.  Query semantics for content-based retrieval of video data: an empirical investigation , 1998, Proceedings Ninth International Workshop on Database and Expert Systems Applications (Cat. No.98EX130).

[26]  G. Leech Douglas Biber, Susan Conrad, and Randi Reppen. Corpus Linguistics: Investigating Language Structure and Use , 1999 .

[27]  Michael Halliday,et al.  An Introduction to Functional Grammar , 1985 .

[28]  Kazuo Tanaka,et al.  Topic-based multimedia structuring , 1997 .

[29]  Hans-Hellmut Nagel,et al.  A vision of vision and language' comprises action: an example from road traffic , 1994 .

[30]  Michael J. Black,et al.  Parameterized modeling and recognition of activities in temporal surfaces , 1999 .

[31]  Dariu Gavrila,et al.  The Visual Analysis of Human Movement: A Survey , 1999, Comput. Vis. Image Underst..

[32]  Borko Furht,et al.  Video and Image Processing in Multimedia Systems , 1995 .

[33]  Peter G. B. Enser Pictorial information retrieval , 1995 .

[34]  Shih-Fu Chang,et al.  A fully automated content-based video search engine supporting spatiotemporal queries , 1998, IEEE Trans. Circuits Syst. Video Technol..

[35]  de Jong,et al.  Human language as Elsnet Interlingua: Intelligent multimedia indexing , 1998 .

[36]  Mark E. Rorvig Introduction to Library Trends 38 (4) Spring 1990: Intellectual Access to Graphic Information , 1990 .

[37]  Mark T. Maybury,et al.  Towards content-based browsing of broadcast news video , 1997 .

[38]  W. Chafe The Pear Stories: Cognitive, Cultural and Linguistic Aspects of Narrative Production , 1980 .

[39]  Milan Sonka,et al.  Image Processing, Analysis and Machine Vision , 1993, Springer US.

[40]  Hans-Hellmut Nagel,et al.  From image sequences towards conceptual descriptions , 1988, Image Vis. Comput..

[41]  Stephen W. Smoliar,et al.  An integrated system for content-based video retrieval and browsing , 1997, Pattern Recognit..

[42]  H. Barlow Vision: A computational investigation into the human representation and processing of visual information: David Marr. San Francisco: W. H. Freeman, 1982. pp. xvi + 397 , 1983 .

[43]  Takeo Kanade,et al.  Name-It: Naming and Detecting Faces in News Videos , 1999, IEEE Multim..

[44]  James F. Allen Maintaining knowledge about temporal intervals , 1983, CACM.

[45]  Aaron F. Bobick,et al.  Recognition of human body motion using phase space constraints , 1995, Proceedings of IEEE International Conference on Computer Vision.

[46]  Yasuo ARIKI,et al.  Organization and Retrieval of VIdeo Data , 1999 .

[47]  Marvin Minsky,et al.  A framework for representing knowledge , 1974 .

[48]  Jan Svartvik,et al.  A __ comprehensive grammar of the English language , 1988 .

[49]  Helene Weiss Randolph Quirk/Sidney Greenbaum/Geoffrey Leech/Jan Svartvik, A Comprehensive Grammar of the English Language , 1987 .

[50]  Yihong Gong,et al.  Lessons Learned from Building a Terabyte Digital Video Library , 1999, Computer.

[51]  Mubarak Shah,et al.  Motion-based recognition a survey , 1995, Image Vis. Comput..

[52]  David K. Gifford,et al.  Composition and Search with a Video Algebra , 1995, IEEE Multim..

[53]  Glorianna Davenport,et al.  Cinematic primitives for multimedia , 1991, IEEE Computer Graphics and Applications.

[54]  Steven K. Feiner,et al.  Generating Multimedia Briefings: Coordinating Language and Illustration , 1998, Artif. Intell..

[55]  Thomas W. Calvert,et al.  Toward a language for human movement , 1986, Comput. Humanit..

[56]  V. Michael Bove,et al.  Multimedia Based on Object Models: Some Whys and Hows , 2022 .

[57]  Takeo Kanade,et al.  Semantic analysis for video contents extraction—spotting by association in news video , 1997, MULTIMEDIA '97.

[58]  V. S. Subrahmanian Principles of Multimedia Database Systems , 1998 .

[59]  FRANK NACK,et al.  Toward the Automated Editing of Theme Oriented Video Sequences , 1997, Appl. Artif. Intell..

[60]  R. Hopper,et al.  Achieving full media interoperability using information systems and indexing schemes , 1999 .

[61]  C. Metz Film Language: A Semiotics of the Cinema , 1974 .

[62]  Alan P. Parkes The prototype cloris system: Describing, retrieving and discussing videodisc stills and sequence , 1989, Inf. Process. Manag..

[63]  Sethuraman Panchanathan,et al.  A critical evaluation of image and video indexing techniques in the compressed domain , 1999, Image Vis. Comput..

[64]  Shih-Fu Chang,et al.  Semantic visual templates: linking visual features to semantics , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[65]  Tom Minka,et al.  Interactive learning with a "society of models" , 1997, Pattern Recognit..

[66]  M. Fischler,et al.  Intelligence: The Eye, the Brain, and the Computer , 1987 .

[67]  Dragutin Petkovic,et al.  Content-Based Representation and Retrieval of Visual Media: A State-of-the-Art Review , 1996 .

[68]  John S. Boreczky,et al.  Indexes for user access to large video databases , 1994, Electronic Imaging.

[69]  Norman I. Badler,et al.  Digital Representations of Human Movement , 1979, CSUR.

[70]  Ian E. Smith,et al.  Authoring and Navigating Video in Space and Time , 1997, IEEE Multim..