A framework for associated news story retrieval

Video retrieval – searching and retrieving videos relevant to a given query – is one of the most popular topics in both real life applications and multimedia research. Finding relevant video content is important for producers of television news, documentaries and commercials. Particularly, in news domain, hundreds of news stories in many different languages are being published everyday by the numerous news agencies and media houses. The huge number of published news stories brings enormous challenges in developing techniques for their efficient retrieval. In particular, there is the challenge of identifying two news clips that discuss the same story. Here, the visual information need not be similar enough for simple near-duplicate video detection algorithms to work. Although, visually two news stories might be different, they might be addressing the same main topic. We call such news stories as associated new stories and the main objective in this thesis is to identify such stories. Therefore, it is imperative that we resort to other modalities such as speech and text for robust retrieval of associated news stories. In the visual domain, associated news stories can be seen as duplicate, near-duplicate, partially near-duplicate videos or in more challenging cases as videos sharing specific visual concepts (e.g. fire, storm, strike, etc). We study Near-Duplicate Keyframe (NDK) identification task as the main core of the visual analysis using different global and local features such as Scale-Invariant Feature Transformation (SIFT). We propose the Constraint Symmetric Matching scheme to match SIFT descriptors between two keyframes and also incorporate other features such as color to tackle the NDK detection task. Next, we cluster keyframes within a news story if they are NDKs and generate a novel scene-level video signature, called scene signature, for each NDK cluster. A scene signature is essentially a Bag-of-SIFT containing both common and distinct visual cues within an NDK cluster and is more compact and discriminative compared to the keyframelevel local feature representation. In addition to scene signature, we generate a visual semantic signature for a news video which is a 374dimensional feature indicating the probability of the presence of the predefined visual concepts in a news story. We integrate these two sources of visual knowledge (i.e. scene signature and semantic signature) to determine enhanced visual content similarity between two stories. In the textual domain, associated news stories usually have common spoken words (by anchor or reporter) and/or displayed words (appear as a closed caption) which can be extracted through Automatic Speech Recognition (ASR) and Optical Character Recognition (OCR), respectively. Since OCR transcripts usually have high error rate, we propose a novel post-processing approach based on the local dictionary idea to recover the erroneous OCR output and identify more informative words, called keywords. We generate an enhanced textual content representation using ASR transcript and OCR keywords through an early fusion scheme. We also employ textual semantic similarity to measure the relatedness of the textual features. Finally, we incorporate all enhanced textual and visual representations/similarities through an early/late fusion scheme, respectively, to investigate their complementary role in the associated news story retrieval task. In the proposed early fusion, we retrieve visual semantics, determined as the visual semantic signature, using textual information provided by ASR and OCR. In the late fusion, we combine enhanced textual and visual content similarities and early fusion similarity through a learning process to boost the retrieval performance. We evaluate the proposed NDK retrieval, detection and clustering approaches in extensive experiments on standard datasets. We also assess the effectiveness and compactness of the proposed scene signature to represent a video compared to other local and global video signatures using a web video dataset. Finally, we show the usefulness of multi-modal approaches using different textual and visual modalities to retrieve associated news stories.

[1]  Wesley De Neve,et al.  Near-Duplicate Video Detection Using Temporal Patterns of Semantic Concepts , 2009, 2009 11th IEEE International Symposium on Multimedia.

[2]  Florian Metze,et al.  Beyond audio and video retrieval: towards multimedia summarization , 2012, ICMR.

[3]  David G. Lowe,et al.  Shape indexing using approximate nearest-neighbour search in high-dimensional spaces , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[4]  Dan Roth,et al.  Robust, Light-weight Approaches to compute Lexical Similarity , 2010 .

[5]  Teruko Mitamura,et al.  Multimedia event detection using visual concept signatures , 2013, Electronic Imaging.

[6]  Avideh Zakhor,et al.  Efficient video similarity measurement with video signature , 2002, Proceedings. International Conference on Image Processing.

[7]  Cordelia Schmid,et al.  INRIA-LEAR'S Video Copy Detection System , 2008, TRECVID.

[8]  Rong Yan,et al.  Probabilistic latent query analysis for combining multiple retrieval sources , 2006, SIGIR.

[9]  Hung-Khoon Tan,et al.  Scalable detection of partial near-duplicate videos by visual-temporal consistency , 2009, ACM Multimedia.

[10]  P. Bhattacharya,et al.  Statistical similarity measures in image retrieval systems with categorization & block based partition , 2005, IEEE International Workshop on Imaging Systems and Techniques, 2005.

[11]  Anil K. Jain,et al.  Text information extraction in images and video: a survey , 2004, Pattern Recognit..

[12]  John R. Smith,et al.  Semantic Indexing of Multimedia Content Using Visual, Audio, and Text Cues , 2003, EURASIP J. Adv. Signal Process..

[13]  M. Grzegorzek,et al.  K-Space Content Management and Retrieval System , 2007, 14th International Conference of Image Analysis and Processing - Workshops (ICIAPW 2007).

[14]  Lifeng Sun,et al.  Joint Inter and Intra Shot Modeling for Spectral Video Shot Clustering , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[15]  Apostol Natsev,et al.  Dynamic Multimodal Fusion in Video Search , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[16]  Zhe Wang,et al.  Efficiently matching sets of features with random histograms , 2008, ACM Multimedia.

[17]  Chng Eng Siong,et al.  Improved Keypoint Matching Method for Near-Duplicate Keyframe Retrieval , 2009, 2009 11th IEEE International Symposium on Multimedia.

[18]  Flavius Frasincar,et al.  Ontology-based news recommendation , 2010, EDBT '10.

[19]  Cordelia Schmid,et al.  Scale & Affine Invariant Interest Point Detectors , 2004, International Journal of Computer Vision.

[20]  Josef Kittler,et al.  On the accuracy of the Sobel edge detector , 1983, Image Vis. Comput..

[21]  Dragutin Petkovic,et al.  Query by Image and Video Content: The QBIC System , 1995, Computer.

[22]  Shih-Fu Chang,et al.  Detecting image near-duplicate by stochastic attributed relational graph matching with learning , 2004, MULTIMEDIA '04.

[23]  Chong-Wah Ngo,et al.  Representations of Keypoint-Based Semantic Concept Detection: A Comprehensive Study , 2010, IEEE Transactions on Multimedia.

[24]  Wei-Hao Lin,et al.  Identifying news videos' ideological perspectives using emphatic patterns of visual concepts , 2009, ACM Multimedia.

[25]  Wei-Hao Lin,et al.  Confounded Expectations: Informedia at TRECVID 2004 , 2004, TRECVID.

[26]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[27]  Tobun Dorbin Ng,et al.  Multi-modal information retrieval from broadcast video using OCR and speech recognition , 2002, JCDL '02.

[28]  Ebroul Izquierdo,et al.  Video Summarisation for Surveillance and News Domain , 2007, SAMT.

[29]  Andrew Boyd,et al.  Broadcast Journalism: Techniques of Radio and Television News , 2008 .

[30]  Qi Tian,et al.  TV Commercial Classification by using Multi-Modal Textual Information , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[31]  Mubarak Shah,et al.  Detection and representation of scenes in videos , 2005, IEEE Transactions on Multimedia.

[32]  Michael Isard,et al.  General Theory , 1969 .

[33]  Richard Alan Peters,et al.  Image Complexity Metrics for Automatic Target Recognizers , 1990 .

[34]  Hung-Khoon Tan,et al.  Real-Time Near-Duplicate Elimination for Web Video Search With Content and Context , 2009, IEEE Transactions on Multimedia.

[35]  Winston H. Hsu,et al.  Video Search and High-Level Feature Extraction , 2005 .

[36]  Chong-Wah Ngo,et al.  Near-duplicate keyframe retrieval with visual keywords and semantic context , 2007, CIVR '07.

[37]  Yan Ke,et al.  An efficient parts-based near-duplicate and sub-image retrieval system , 2004, MULTIMEDIA '04.

[38]  John R. Smith,et al.  VideoAL: a novel end-to-end MPEG-7 video automatic labeling system , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[39]  Jhing-Fa Wang,et al.  A Novel Video Summarization Based on Mining the Story-Structure and Semantic Relations Among Concept Entities , 2009, IEEE Transactions on Multimedia.

[40]  Beng Chin Ooi,et al.  Towards effective indexing for very large video sequence database , 2005, SIGMOD '05.

[41]  Hung-Khoon Tan,et al.  Near-Duplicate Keyframe Identification With Interest Point Matching and Pattern Learning , 2007, IEEE Transactions on Multimedia.

[42]  Ilya Zavorin,et al.  A filter based post-OCR accuracy boost system , 2004, HDP '04.

[43]  Mei-Chen Yeh,et al.  Multimodal fusion using learned text concepts for image categorization , 2006, MM '06.

[44]  Peter Kolb,et al.  Experiments on the difference between semantic similarity and relatedness , 2009, NODALIDA.

[45]  Trevor Darrell,et al.  The pyramid match kernel: discriminative classification with sets of image features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[46]  Shuicheng Yan,et al.  SIFT-Bag kernel for video event analysis , 2008, ACM Multimedia.

[47]  Pierangelo Migliorati,et al.  Interactive visualization of video content and associated description for semantic annotation , 2009, Signal Image Video Process..

[48]  N. Jojic,et al.  Scene generative models for adaptive video fast forward , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[49]  Paul Over,et al.  The TREC VIdeo Retrieval Evaluation (TRECVID): A Case Study and Status Report , 2004, RIAO.

[50]  Olivier Buisson,et al.  Robust voting algorithm based on labels of behavior for video copy detection , 2006, MM '06.

[51]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[52]  Daniel P. W. Ellis,et al.  A Large-Scale Evaluation of Acoustic and Subjective Music-Similarity Measures , 2004, Computer Music Journal.

[53]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[54]  Jiebo Luo,et al.  Utilizing semantic word similarity measures for video retrieval , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[55]  John R. Smith,et al.  Image Classification and Querying Using Composite Region Templates , 1999, Comput. Vis. Image Underst..

[56]  Hugo Liu,et al.  ConceptNet — A Practical Commonsense Reasoning Tool-Kit , 2004 .

[57]  Angela Schwering,et al.  Hybrid Model for Semantic Similarity Measurement , 2005, OTM Conferences.

[58]  Dan I. Moldovan,et al.  LCC at TRECVID 2005 , 2005, TRECVID.

[59]  Jun Adachi,et al.  Scene duplicate detection based on the pattern of discontinuities in feature point trajectories , 2008, ACM Multimedia.

[60]  Andrew B. Kahng,et al.  New spectral methods for ratio cut partitioning and clustering , 1991, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[61]  Lillian Lee,et al.  Measures of Distributional Similarity , 1999, ACL.

[62]  Hao Jiang,et al.  Integrating visual, audio and text analysis for news video , 2000, Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101).

[63]  Diane J. Cook,et al.  Automatic Video Classification: A Survey of the Literature , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[64]  Chong-Wah Ngo,et al.  Video summarization and scene detection by graph modeling , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[65]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[66]  Rong Yan,et al.  Can High-Level Concepts Fill the Semantic Gap in Video Retrieval? A Case Study With Broadcast News , 2007, IEEE Transactions on Multimedia.

[67]  Datong Chen,et al.  Improving multimedia retrieval with a video OCR , 2008, Electronic Imaging.

[68]  Alan F. Smeaton,et al.  A Comparison of Score, Rank and Probability-Based Fusion Methods for Video Shot Retrieval , 2005, CIVR.

[69]  S. Aksoy,et al.  A Relevance Feedback Technique for Multimodal Retrieval of News Videos , 2005, EUROCON 2005 - The International Conference on "Computer as a Tool".

[70]  Hiroshi Murase,et al.  Cross-Lingual Retrieval of Identical News Events by Near-Duplicate Video Segment Detection , 2008, MMM.

[71]  Michael Sintek,et al.  NEWS: Bringing Semantic Web Technologies into News Agencies , 2006, SEMWEB.

[72]  Marcel Worring,et al.  Concept-Based Video Retrieval , 2009, Found. Trends Inf. Retr..

[73]  Dong Xu,et al.  Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[74]  Alexander C. Loui,et al.  Finding structure in home videos by probabilistic hierarchical clustering , 2003, IEEE Trans. Circuits Syst. Video Technol..

[75]  Wei Liu,et al.  Double Fusion for Multimedia Event Detection , 2012, MMM.

[76]  Dong Xu,et al.  Near duplicate image identification with patially Aligned Pyramid Matching , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[77]  Athman Bouguettaya,et al.  An Efficient Near-Duplicate Video Shot Detection Method Using Shot-Based Interest Points , 2009, IEEE Transactions on Multimedia.

[78]  Mohamed S. Kamel,et al.  Document Clustering Using Semantic Kernels Based on Term-Term Correlations , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[79]  Dorin Comaniciu,et al.  Real-time tracking of non-rigid objects using mean shift , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[80]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[81]  Yue Gao,et al.  Shot-based similarity measure for content-based video summarization , 2008, 2008 15th IEEE International Conference on Image Processing.

[82]  Justin Zobel,et al.  Video Similarity Detection for Digital Rights Management , 2003, ACSC.

[83]  Maya R. Gupta,et al.  OCR binarization and image pre-processing for searching historical documents , 2007, Pattern Recognit..

[84]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[85]  George A. Miller,et al.  Using Corpus Statistics and WordNet Relations for Sense Identification , 1998, CL.

[86]  Chun-Rong Huang,et al.  Video scene detection by link-constrained affinity-propagation , 2009, 2009 IEEE International Symposium on Circuits and Systems.

[87]  Chong-Wah Ngo,et al.  Multimodal News Story Clustering With Pairwise Visual Near-Duplicate Constraint , 2008, IEEE Transactions on Multimedia.

[88]  Ruud M. Bolle,et al.  Comparison of distance measures for video copy detection , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[89]  Silvio Savarese,et al.  Discriminative Object Class Models of Appearance and Shape by Correlatons , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[90]  Chong-Wah Ngo,et al.  Fast tracking of near-duplicate keyframes in broadcast domain with transitivity propagation , 2006, MM '06.

[91]  David S. Doermann,et al.  Binarization of low quality text using a Markov random field model , 2002, Object recognition supported by user interaction for service robots.

[92]  Teruko Mitamura,et al.  Multimodal knowledge-based analysis in multimedia event detection , 2012, ICMR '12.

[93]  Lei Chen,et al.  Monitoring near duplicates over video streams , 2010, ACM Multimedia.

[94]  Chong-Wah Ngo,et al.  Practical elimination of near-duplicates from web video search , 2007, ACM Multimedia.

[95]  Jun Adachi,et al.  Scene duplicate detection from videos based on trajectories of feature points , 2007, MIR '07.

[96]  Zi Huang,et al.  UQLIPS: A Real-time Near-duplicate Video Clip Detection System , 2007, VLDB.

[97]  John R. Smith,et al.  Multimedia semantic indexing using model vectors , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[98]  Liang-Tien Chia,et al.  Image near-duplicate retrieval using local dependencies in spatial-scale space , 2008, ACM Multimedia.

[99]  Riccardo Leonardi,et al.  An overview of video shot clustering and summarization techniques for mobile applications , 2006, MobiMedia '06.

[100]  Thomas S. Huang,et al.  Supporting similarity queries in MARS , 1997, MULTIMEDIA '97.

[101]  M. Smith,et al.  Video Skimming for Quick Browsing based on Audio and Image Characterization , 1995 .

[102]  Chong-Wah Ngo,et al.  Novelty detection for cross-lingual news stories with visual duplicates and speech transcripts , 2007, ACM Multimedia.

[103]  Rong Yan,et al.  Learning query-class dependent weights in automatic video retrieval , 2004, MULTIMEDIA '04.

[104]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[105]  Diane J. Cook,et al.  Using Closed Captions and Visual Features to Classify Movies by Genre , 2006 .

[106]  A. Murat Tekalp,et al.  Audiovisual Synchronization and Fusion Using Canonical Correlation Analysis , 2007, IEEE Transactions on Multimedia.

[107]  Zi Huang,et al.  Practical Online Near-Duplicate Subsequence Detection for Continuous Video Streams , 2010, IEEE Transactions on Multimedia.

[108]  Justin Zobel,et al.  Detection of video sequences using compact signatures , 2006, TOIS.

[109]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[110]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[111]  Jiri Matas,et al.  Randomized RANSAC with sequential probability ratio test , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[112]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[113]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[114]  Mohan S. Kankanhalli,et al.  Multimodal fusion for multimedia analysis: a survey , 2010, Multimedia Systems.

[115]  Pietro Perona,et al.  Object class recognition by unsupervised scale-invariant learning , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[116]  Nuria Oliver,et al.  Telefonica Research at TRECVID 2010 Content-Based Copy Detection , 2010, TRECVID.

[117]  Jean-Marc Odobez,et al.  Spectral Structuring of Home Videos , 2003, CIVR.

[118]  Shuicheng Yan,et al.  Near-duplicate keyframe retrieval by nonrigid image matching , 2008, ACM Multimedia.

[119]  Edward Y. Chang,et al.  RIME: a replicated image detector for the World Wide Web , 1998, Other Conferences.

[120]  Chong-Wah Ngo,et al.  Scale-Rotation Invariant Pattern Entropy for Keypoint-Based Near-Duplicate Detection , 2009, IEEE Transactions on Image Processing.

[121]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .