Unsupervised Alignment of News Video and Text Using Visual Patterns and Textual Concepts

A brief preview of a news video can be generated by semantically aligning the textual sentences of the anchor report, summarized by the anchor, with the visual field shots. Since accurately detecting the object in a visual shot is difficult and a textual term may generally correspond to several synonyms, the alignment of an anchor sentence with a video shot remains challenging. In this study, the temporal relation among the frames in a visual shot is characterized by a visual language model. The language model-based temporal relation is then applied to sentence-based alignment. The bag-of-word representations for the main objects in the key frames of a visual shot are firstly mapped to the visual patterns trained from the news video database. Furthermore, the textual terms in the report sentence are mapped to the textual concepts that are obtained from the HowNet knowledge base. Finally, unsupervised alignment between the textual concepts and the visual patterns in the news videos is performed using the IBM model-1. For evaluation, the visual pattern language model yields an alignment score of 0.77, exceeding that, 0.66, from the DTW method. Considering the performance for different news categories, visual pattern discovery and textual concept discovery can indeed improve the alignment performance in most news categories.

[1]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[2]  Masashi Morimoto,et al.  Visual pattern discovery using web images , 2006, MIR '06.

[3]  Chung-Hsien Wu,et al.  Speech Sentence Compression Based on Speech Segment Extraction and Concatenation , 2007, IEEE Transactions on Multimedia.

[4]  Tao Mei,et al.  Video Concept Detection Using Support Vector Machines - TRECVID 2007 Evaluations , 2007 .

[5]  Fu-Ren Lin,et al.  Storyline-based summarization for news topic retrospection , 2008, Decis. Support Syst..

[6]  Chung-Hsien Wu,et al.  Video News Retrieval Incorporating Relevant Terms Based on Distribution of Document Frequency , 2008, PCM.

[7]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[8]  Alan Hanjalic,et al.  Affective video content representation and modeling , 2005, IEEE Transactions on Multimedia.

[9]  Dai Ran,et al.  A Sufficient and Necessary Condition for the Absolute Consistency of XML DTDs , 2007, Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD 2007).

[10]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[11]  Clement H. C. Leung,et al.  Automatic Semantic Annotation of Real-World Web Images , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Andrew Zisserman,et al.  Scene Classification Using a Hybrid Generative/Discriminative Approach , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Yong Wang,et al.  Translating topics to words for image annotation , 2007, CIKM '07.

[14]  Christos Faloutsos,et al.  MMSS: multi-modal story-oriented video summarization , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[15]  Junyu Dong,et al.  Combining Color, Texture and Region with Objects of User's Interest for Content-Based Image Retrieval , 2007, Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD 2007).

[16]  Pierre Tirilly,et al.  Language modeling for bag-of-visual words image categorization , 2008, CIVR '08.

[17]  Yee Whye Teh,et al.  Names and faces in the news , 2004, CVPR 2004.

[18]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[19]  Brian V. Funt,et al.  Color Constant Color Indexing , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Tao Mei,et al.  Correlative multi-label video annotation , 2007, ACM Multimedia.

[21]  Hyung-Myung Kim,et al.  Summarization of news video and its description for content‐based access , 2003, Int. J. Imaging Syst. Technol..

[22]  Pietro Perona,et al.  Learning object categories from Google's image search , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[23]  Yee Whye Teh,et al.  Names and faces in the news , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[24]  Jiebo Luo,et al.  Large-scale multimodal semantic concept detection for consumer video , 2007, MIR '07.

[25]  Y. Mori,et al.  Image-to-word transformation based on dividing and vector quantizing images with words , 1999 .

[26]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[27]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[28]  John R. Kender,et al.  A unified memory based approach to cut, dissolve, key frame and scene analysis , 2001, Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205).

[29]  Chang-Shing Lee,et al.  A fuzzy ontology and its application to news summarization , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[30]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[31]  Bertrand Le Saux,et al.  Image recognition for digital libraries , 2004, MIR '04.

[32]  Hsin-Min Wang,et al.  MATBN: A Mandarin Chinese Broadcast News Corpus , 2005, Int. J. Comput. Linguistics Chin. Lang. Process..

[33]  Ramin Zabih,et al.  Histogram refinement for content-based image retrieval , 1996, Proceedings Third IEEE Workshop on Applications of Computer Vision. WACV'96.

[34]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[35]  Changsheng Xu,et al.  A Novel Framework for Semantic Annotation and Personalized Retrieval of Sports Video , 2008, IEEE Transactions on Multimedia.

[36]  Keiji Yanai Web image selection with PLSA , 2008, 2008 IEEE International Conference on Multimedia and Expo.