A System for the Semantic Multimodal Analysis of News Audio-Visual Content

News-related content is nowadays among the most popular types of content for users in everyday applications. Although the generation and distribution of news content has become commonplace, due to the availability of inexpensive media capturing devices and the development of media sharing services targeting both professional and user-generated news content, the automatic analysis and annotation that is required for supporting intelligent search and delivery of this content remains an open issue. In this paper, a complete architecture for knowledge-assisted multimodal analysis of news-related multimedia content is presented, along with its constituent components. The proposed analysis architecture employs state-of-the-art methods for the analysis of each individual modality (visual, audio, text) separately and proposes a novel fusion technique based on the particular characteristics of news-related content for the combination of the individual modality analysis results. Experimental results on news broadcast video illustrate the usefulness of the proposed techniques in the automatic generation of semantic annotations.

[1]  Jane Hunter,et al.  Evaluating the application of semantic inferencing rules to image annotation , 2005, K-CAP '05.

[2]  Shih-Fu Chang,et al.  The holy grail of content-based media analysis , 2002 .

[3]  David A. van Leeuwen,et al.  The AMI Speaker Diarization System for NIST RT06s Meeting Data , 2006, MLMI.

[4]  Robert P. W. Duin,et al.  Using two-class classifiers for multiclass classification , 2002, Object recognition supported by user interaction for service robots.

[5]  Thomas Sikora,et al.  The MPEG-7 visual standard for content description-an overview , 2001, IEEE Trans. Circuits Syst. Video Technol..

[6]  Ulrich Schäfer,et al.  Shallow Processing with Unification and Typed Feature Structures - Foundations and Applications , 2004, Künstliche Intell..

[7]  Wei-Hao Lin,et al.  News video classification using SVM-based multimodal classifiers and combination strategies , 2002, MULTIMEDIA '02.

[8]  Bo Zhang,et al.  Support vector machine learning for image retrieval , 2001, Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205).

[9]  Franciska de Jong,et al.  Fast n-gram language model look-ahead for decoders with static pronunciation prefix trees , 2008, INTERSPEECH.

[10]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[11]  Tao Qin,et al.  Supervised rank aggregation , 2007, WWW '07.

[12]  Steve Young,et al.  Token passing: a simple conceptual model for connected speech recognition systems , 1989 .

[13]  Jing Zhang,et al.  A Novel Video Searching Model Based on Ontology Inference and Multimodal Information Fusion , 2008, 2008 International Symposium on Computer Science and Computational Technology.

[14]  Petros Maragos,et al.  Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Michael G. Strintzis,et al.  Knowledge-assisted semantic video object detection , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[16]  Michael G. Strintzis,et al.  Ontology-Driven Semantic Video Analysis Using Visual Information Objects , 2007, SAMT.

[17]  Ulrich Schäfer,et al.  SProUT - A General-Purpose NLP Framework Integrating Finite-State and Unification-Based Grammar Formalisms , 2005, FSMNLP.

[18]  Daniel S. Yeung,et al.  Information extraction based on information fusion from multiple news sources from the web , 2008, 2008 IEEE International Conference on Systems, Man and Cybernetics.

[19]  Hao Jiang,et al.  Integrating visual, audio and text analysis for news video , 2000, Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101).

[20]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[21]  Alberto Del Bimbo,et al.  Soccer highlights detection and recognition using HMMs , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[22]  Yiannis Kompatsiaris,et al.  Knowledge-assisted cross-media analysis of audio-visual content in the news domain , 2008, 2008 International Workshop on Content-Based Multimedia Indexing.

[23]  Jean-Gabriel Ganascia,et al.  High-level fusion based on conceptual graphs , 2007, 2007 10th International Conference on Information Fusion.

[24]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[25]  Edward Y. Chang,et al.  Multimodal information fusion for video concept detection , 2004, 2004 International Conference on Image Processing, 2004. ICIP '04..

[26]  Ming-yu Chen,et al.  Multi-modal classification in digital news libraries , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[27]  David A. van Leeuwen,et al.  The TNO Speaker Diarization System for NIST RT05s Meeting Data , 2005, MLMI.

[28]  Bob Carpenter,et al.  The logic of typed feature structures , 1992 .

[29]  Edward Y. Chang,et al.  Support vector machine active learning for image retrieval , 2001, MULTIMEDIA '01.

[30]  Jesús Bescós,et al.  Real-time shot change detection over online MPEG-2 video , 2004, IEEE Transactions on Circuits and Systems for Video Technology.

[31]  Jianping Fan,et al.  Integrating multi-modal content analysis and hyperbolic visualization for large-scale news video retrieval and exploration , 2008, Signal Process. Image Commun..

[32]  Michael G. Strintzis,et al.  Still Image Segmentation Tools For Object-Based Multimedia Applications , 2004, Int. J. Pattern Recognit. Artif. Intell..

[33]  Shih-Fu Chang,et al.  Query-Adaptive Fusion for Multimodal Search , 2008, Proceedings of the IEEE.

[34]  Michael G. Strintzis,et al.  Combining Global and Local Information for Knowledge-Assisted Image Analysis and Classification , 2007, EURASIP J. Adv. Signal Process..

[35]  Wen-Nung Lie,et al.  News video classification based on multi-modal information fusion , 2005, IEEE International Conference on Image Processing 2005.

[36]  Jean-Gabriel Ganascia,et al.  Information fusion in a TV program recommendation system , 2008, 2008 11th International Conference on Information Fusion.

[37]  Javed A. Aslam,et al.  Models for metasearch , 2001, SIGIR '01.

[38]  Peng Wang,et al.  A hybrid approach to news video classification multimodal features , 2003, Fourth International Conference on Information, Communications and Signal Processing, 2003 and the Fourth Pacific Rim Conference on Multimedia. Proceedings of the 2003 Joint.

[39]  Jan Alexandersson,et al.  Overlay as the Basic Operation for Discourse Processing in a Multimodal Dialogue System , 2001 .

[40]  Franciska de Jong,et al.  Multimedia Search Without Visual Analysis: The Value of Linguistic and Contextual Information , 2007, IEEE Transactions on Circuits and Systems for Video Technology.

[41]  Roeland Ordelman,et al.  Filtering the unknown: speech activity detection in heterogeneous video collections , 2007, INTERSPEECH.

[42]  A. Zoubir,et al.  EURASIP Journal on Advances in Signal Processing , 2011 .