论文信息 - Speech technology plays a key role in video semantic indexing

Speech technology plays a key role in video semantic indexing

Video semantic indexing is a core task in content-based video retrieval (CBVR), in which a user submits a text query for an object or a scene to a search system and the system returns video shots that include the object or scene. We introduce an emerging framework for this task, which heavily relies on statistical speaker verification and adaptation techniques. It employs Gaussian-mixture-model (GMM) supervectors and support vector machines (SVM) to detect a large variety of objects and scenes robustly from video. It has shown excellent performance in the Semantic indexing task of the TRECVID 2011 workshop, where a large archive of consumer-produced Internet videos are used for evaluation.

Koichi Shinoda

[1] Paul Over,et al. Evaluation campaigns and TRECVid , 2006, MIR '06.

[2] Koichi Shinoda,et al. A Fast and Accurate Video Semantic-Indexing System Using Fast MAP Adaptation and GMM Supervectors , 2012, IEEE Transactions on Multimedia.

[3] Chin-Hui Lee,et al. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[4] Douglas E. Sturim,et al. Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[5] John R. Smith,et al. Large-scale concept ontology for multimedia , 2006, IEEE MultiMedia.

[6] G LoweDavid,et al. Distinctive Image Features from Scale-Invariant Keypoints , 2004 .