This paper deals with a semantic segmentation in video streams. The proposed method aims to detect story segments in an episode of a TV drama. For reliable story segmentation, the challenge is to build a robust model capable of capturing the underlying consistent latent states corresponding to a story segment while handling frequent changes in low-level features. We address this challenge by utilizing multiple channels inherent in a TV drama. For a given TV drama episode, it is possible to disassemble the stream into visual, sound, or textual channels. The proposed method builds a dynamic model for each modality channel and combines the resulted models. The difficulty with this approach is the difference in time granularity of each channel. In order to dissolve the difference, we introduce a hierarchical model where a common latent state generates scenes and dialogue in a story segment. Each dynamic model analyzes its own segment in the assigned channel and at a higher level the composite likelihood of a story segment change is estimated based on the each channel’s estimation. We report preliminary estimation results in this paper. 1 Semantic segmenting in TV dramas and the proposed method We discuss a semantic segmentation method for video streams. A method for partitioning a series of images in groups is already introduced in [1] and video segmentation methods using prior knowledge or repetitions in scenes are being actively researched [2]. Contrary to previous researches, our method attempts to detect inherent story segments. If a story is defined as “a topically cohesive segment of episodes that include multiple sentences and events about a single topic” (modified definition of [3]), a video stream could be interpreted as a set of stories. The difficulties associated with semantic segmentation are as followings: (a) there dose not exist a suitable method for sematic analysis and (b) one should handle frequent changes in low-level features. In order to address these challenges, we propose a composite scheme based on the hierarchical Dirichlet process (HDP) [4]. The proposed method builds a separate dynamic model for each of the image channel and sound channel in an episode of a TV drama. An image channel dynamic model handles scene data (Fig.1 (a) and Eq. 1) and a sound channel dynamic model analyzes dialogues of each character in a video stream (Fig.1 (b) and Eq. 2). Each dynamic model is similar to the sticky HDP-HMM in [5]. The different time granularity of each channel makes analysis of video streams difficult. This difficulty is alleviated by considering likelihood of F (xLt|GL) and F (SLj |GL) (xLt: the tth scene in a story segment L, SLj : jth sound state, F (x|θ): likelihood of x given θ ). When ∗This work was supported by the National Research Foundation of Korea grand funded by the Korean government (MEST) (No. 2011-0016483).
[1]
Chin-Hui Lee,et al.
A Multi-Modal Approach to Story Segmentation for News Video
,
2003,
World Wide Web.
[2]
Michael I. Jordan,et al.
Bayesian Nonparametric Inference of Switching Dynamic Linear Models
,
2010,
IEEE Transactions on Signal Processing.
[3]
Michael I. Jordan,et al.
Hierarchical Dirichlet Processes
,
2006
.
[4]
Joachim M. Buhmann,et al.
Bayesian Order-Adaptive Clustering for Video Segmentation
,
2007,
EMMCVPR.
[5]
David B. Dunson,et al.
The dynamic hierarchical Dirichlet process
,
2008,
ICML '08.
[6]
David G. Lowe,et al.
Object recognition from local scale-invariant features
,
1999,
Proceedings of the Seventh IEEE International Conference on Computer Vision.