Multimodal concept fusion using semantic closeness for image concept disambiguation

In this paper we show how to resolve the ambiguity of concepts that are extracted from visual stream with the help of identified concepts from associated textual stream. The disambiguation is performed at the concept-level based on semantic closeness over the domain ontology. The semantic closeness is a function of the distance between the concept to be disambiguated and selected associated concepts in the ontology. In this process, the image concepts will be disambiguated with any associated concept from the image and/or the text. The ability of the text concepts to resolve the ambiguity in the image concepts is varied. The best talent to resolve the ambiguity of an image concept occurs when the same concept(s) is stated clearly in both image and text, while, the worst case occurs when the image concept is an isolated concept that has no semantically close text concept. WordNet and the image labels with selected senses are used to construct the domain ontology used in the disambiguation process. The improved accuracy, as shown in the results, proves the ability of the proposed disambiguation process.

[1]  Martial Hebert,et al.  A hierarchical field framework for unified context-based classification , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[2]  Angelo Chianese,et al.  Scene detection using visual and audio attention , 2008, AMDIT '08.

[3]  Xiaojin Zhu,et al.  A Topic Model for Word Sense Disambiguation , 2007, EMNLP.

[4]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[5]  Adriana Santarosa Vivacqua,et al.  From data to knowledge mining , 2009, Artificial Intelligence for Engineering Design, Analysis and Manufacturing.

[6]  Jiebo Luo,et al.  Leveraging probabilistic season and location context models for scene understanding , 2008, CIVR '08.

[7]  Yung-Yu Chuang,et al.  Multi-cue fusion for semantic video indexing , 2008, ACM Multimedia.

[8]  Martin Chodorow,et al.  Combining local context and wordnet similarity for word sense identification , 1998 .

[9]  Dong-Ho Lee,et al.  Full-Automatic High-Level Concept Extraction from Images Using Ontologies and Semantic Inference Rules , 2006, ASWC.

[10]  Latifur Khan,et al.  Image annotations by combining multiple evidence & wordNet , 2005, ACM Multimedia.

[11]  U. C. Niranjan,et al.  Linear Models of Cumulative Distribution Function for Content-based Medical Image Retrieval , 2007, 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference.

[12]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[13]  Serge J. Belongie,et al.  Object categorization using co-occurrence, location and appearance , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Yi Wu,et al.  Ontology-based multi-classification learning for video concept detection , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[15]  Shih-Fu Chang,et al.  Semantic knowledge construction from annotated image collections , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[16]  Paul Clough,et al.  The IAPR TC-12 Benchmark: A New Evaluation Resource for Visual Information Systems , 2006 .

[17]  Yannis Avrithis,et al.  Semantic Image Segmentation and Object Labeling , 2007, IEEE Transactions on Circuits and Systems for Video Technology.

[18]  Christian Thies Bridging the semantic gap for object extraction from biomedical images by classification , 2007 .

[19]  Holger Knublauch,et al.  The Protégé OWL Plugin: An Open Development Environment for Semantic Web Applications , 2004, SEMWEB.

[20]  Ying Liu,et al.  A survey of content-based image retrieval with high-level semantics , 2007, Pattern Recognit..

[21]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[22]  Jiebo Luo,et al.  Probabilistic spatial context models for scene content understanding , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[23]  David A. Forsyth,et al.  Learning the semantics of words and pictures , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[24]  Atilla Baskurt,et al.  Image understanding and scene models: a generic framework integrating domain knowledge and Gestalt theory , 2004, 2004 International Conference on Image Processing, 2004. ICIP '04..

[25]  Ben Liang,et al.  Proceedings of the 2008 Ambi-Sys workshop on Ambient media delivery and interactive television , 2008, Ambi-sys 2008.

[26]  Kobus Barnard,et al.  Word Sense Disambiguation with Pictures , 2003, Artif. Intell..

[27]  Xiaodong Fan Contextual disambiguation for multi-class object detection , 2004, 2004 International Conference on Image Processing, 2004. ICIP '04..

[28]  Patrick Gros,et al.  Audiovisual integration with Segment Models for tennis video parsing , 2008, Comput. Vis. Image Underst..

[29]  Ioannis Pitas,et al.  A neural network approach to audio-assisted movie dialogue detection , 2007, Neurocomputing.