Modal Keywords, Ontologies, and Reasoning for Video Understanding

We proposed a novel framework for video content understanding that uses rules constructed from knowledge bases and multimedia ontologies. Our framework consists of an expert system that uses a rule-based engine, domain knowledge, visual detectors (for objects and scenes), and metadata (text from automatic speech recognition, related text, etc.). We introduce the idea of modal keywords, which are keywords that represent perceptual concepts in the following categories: visual (e.g., sky), aural (e.g., scream), olfactory (e.g., vanilla), tactile (e.g., feather), and taste (e.g., candy). A method is presented to automatically classify keywords from speech recognition, queries, or related text into these categories using WordNet and TGM I. For video understanding, the following operations are performed automatically: scene cut detection, automatic speech recognition, feature extraction, and visual detection (e.g., sky, face, indoor). These operation results are used in our system by a rule-based engine that uses context information (e.g., text from speech) to enhance visual detection results. We discuss semi-automatic construction of multimedia ontologies and present experiments in which visual detector outputs are modified by simple rules that use context information available with the video.

[1]  Bob J. Wielinga,et al.  Ontology-Based Photo Annotation , 2001, IEEE Intell. Syst..

[2]  Deb K. Roy,et al.  Learning visually grounded words and syntax for a scene description task , 2002, Comput. Speech Lang..

[3]  HongJiang Zhang,et al.  Thesaurus-aided approach for image browsing and retrieval , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[4]  Bipin Indurkhya,et al.  Modeling Context Effect in Perceptual Domains , 2001, CONTEXT.

[5]  Thomas S. Huang,et al.  Factor graph framework for semantic video indexing , 2002, IEEE Trans. Circuits Syst. Video Technol..

[6]  Neil C. Rowe Marie-4: A High-Recall, Self-Improving Web Crawler That Finds Images Using Captions , 2002, IEEE Intell. Syst..

[7]  Shih-Fu Chang,et al.  Conceptual structures and computational methods for indexing and organization of visual information , 2003 .

[8]  John R. Smith,et al.  Exploring semantic dependencies for scalable concept detection , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[9]  Clement T. Yu,et al.  Using semantic contents and WordNet in image retrieval , 1997, SIGIR '97.

[10]  Luis Gravano,et al.  The Stanford Digital Library metadata architecture , 1997, International Journal on Digital Libraries.

[11]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[12]  John R. Smith,et al.  Semi-automatic, data-driven construction of multimedia ontologies , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[13]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[14]  Shih-Fu Chang,et al.  Semantic knowledge construction from annotated image collections , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[15]  Varol Akman,et al.  Modeling and Using Context: Third International and Interdisciplinary Conference, CONTEXT, 2001, Dundee, UK, July 27-30, 2001, Proceedings , 2001 .

[16]  Dennis McLeod,et al.  Audio structuring and personalized retrieval using ontologies , 2000, Proceedings IEEE Advances in Digital Libraries 2000.

[17]  Cristina Bosco,et al.  Context and Multi-media Corpora , 2001, CONTEXT.

[18]  David A. Forsyth,et al.  Learning the semantics of words and pictures , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[19]  John R. Smith,et al.  Normalized classifier fusion for semantic visual concept detection , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[20]  John P. McDermott,et al.  Rule-Based Interpretation of Aerial Imagery , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[22]  Alan F. Smeaton,et al.  Experiments on using semantic distances between words in image caption retrieval , 1996, SIGIR '96.

[23]  Nicola Guarino,et al.  Formal Ontology and Information Systems , 1998 .

[24]  Haim H. Permuter,et al.  IBM Research TREC 2002 Video Retrieval System , 2002, TREC.

[25]  Neil C. Rowe,et al.  Natural-language retrieval of images based on descriptive captions , 1996, TOIS.

[26]  Shih-Fu Chang,et al.  MediaNet: a multimedia information network for knowledge representation , 2000, SPIE Optics East.

[27]  N. Shiotani,et al.  Image retrieval system using an iconic thesaurus , 1997, 1997 IEEE International Conference on Intelligent Processing Systems (Cat. No.97TH8335).

[28]  Steffen Staab,et al.  SEAL - Tying Up Information Integration and Web Site Management by Ontologies , 2002, IEEE Data Eng. Bull..

[29]  John Durkin,et al.  Expert systems - design and development , 1994 .

[30]  Stephen Armstrong,et al.  The what, who, where, when, why and how of context-awareness , 2000, CHI Extended Abstracts.

[31]  John R. Smith,et al.  Context-enhanced video understanding , 2003, IS&T/SPIE Electronic Imaging.