Unsupervised Discovery of Action Hierarchies in Large Collections of Activity Videos

Given a large collection of videos containing activities, we investigate the problem of organizing it in an unsupervised fashion into a hierarchy based on the similarity of actions embedded in the videos. We use spatio-temporal volumes of filtered motion vectors to compute appearance-invariant action similarity measures efficiently -and use these similarity measures in hierarchical agglomerative clustering to organize videos into a hierarchy such that neighboring nodes contain similar actions. This naturally leads to a simple automatic scheme for selecting videos of representative actions (exemplars) from the database and for efficiently indexing the whole database. We compute a performance metric on the hierarchical structure to evaluate goodness of the estimated hierarchy, and show that this metric has potential for predicting the clustering performance of various joining criteria used in building hierarchies. Our results show that perceptually meaningful hierarchies can be constructed based on action similarities with minimal user supervision, while providing favorable clustering performance and retrieval performance.

[1]  Chong-Wah Ngo,et al.  On clustering and retrieval of video shots through temporal slices analysis , 2002, IEEE Trans. Multim..

[2]  Rolph E. Anderson,et al.  Multivariate data analysis (4th ed.): with readings , 1995 .

[3]  J. Hair Multivariate data analysis , 1972 .

[4]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[5]  Andrew R. Webb,et al.  Statistical Pattern Recognition , 1999 .

[6]  R. Venkatesh Babu,et al.  Compressed domain action classification using HMM , 2002, Pattern Recognit. Lett..

[7]  Shih-Fu Chang,et al.  A fully automated content-based video search engine supporting spatiotemporal queries , 1998, IEEE Trans. Circuits Syst. Video Technol..

[8]  Eli Shechtman,et al.  Space-time behavior based correlation , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[9]  Forouzan Golshani,et al.  Rx for semantic video database retrieval , 1994, MULTIMEDIA '94.

[10]  F. Rohlf,et al.  Tests for Hierarchical Structure in Random Data Sets , 1968 .

[11]  M. Davies,et al.  Approximating optical flow within the MPEG-2 compressed domain , 2005 .

[12]  Minerva M. Yeung,et al.  Efficient matching and clustering of video shots , 1995, Proceedings., International Conference on Image Processing.

[13]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[14]  S. Shankar Sastry,et al.  Compressed Domain Real-time Action Recognition , 2006, 2006 IEEE Workshop on Multimedia Signal Processing.

[15]  J. Farris On the Cophenetic Correlation Coefficient , 1969 .

[16]  Jitendra Malik,et al.  Recognizing action at a distance , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.