Feature selection for unsupervised discovery of statistical temporal structures in video

In this paper, we present algorithms for automatic feature selection for of structure discovery from video sequences. Feature selection in this scenario is hard because of the absence of class labels to evaluate against, and the temporal correlation among samples that prevents the direct estimation of posterior probabilities of the cluster given the sequence. The overall problem of structure discovery is formulated as simultaneously finding the statistical descriptions of structure and locating segments that matches the descriptions. Under Markov assumptions among events, structures in the video are modelled with hierarchical hidden Markov models, with efficient algorithms to jointly learn the model parameters and the optimal model complexity. Feature selection iterates between a wrapper step that partitions the large feature pool into consistent subsets, and a filter step that eliminate redundancy within these subsets, respectively. The feature subsets are then ranked according to the normalized Bayesian Information criteria, and the learning results from these ranked subsets can be evaluated and interpreted by a human observer. Results on soccer and baseball videos show that the automatically selected feature set coincides with those selected with domain knowledge and intuition, while achieving a correspondence comparable to that of supervised learning against manually labelled ground truth.