Fast unsupervised ego-action learning for first-person sports videos

Portable high-quality sports cameras (e.g. head or helmet mounted) built for recording dynamic first-person video footage are becoming a common item among many sports enthusiasts. We address the novel task of discovering first-person action categories (which we call ego-actions) which can be useful for such tasks as video indexing and retrieval. In order to learn ego-action categories, we investigate the use of motion-based histograms and unsupervised learning algorithms to quickly cluster video content. Our approach assumes a completely unsupervised scenario, where labeled training videos are not available, videos are not pre-segmented and the number of ego-action categories are unknown. In our proposed framework we show that a stacked Dirichlet process mixture model can be used to automatically learn a motion histogram codebook and the set of ego-action categories. We quantitatively evaluate our approach on both in-house and public YouTube videos and demonstrate robust ego-action categorization across several sports genres. Comparative analysis shows that our approach outperforms other state-of-the-art topic models with respect to both classification accuracy and computational speed. Preliminary results indicate that on average, the categorical content of a 10 minute video sequence can be indexed in under 5 seconds.

[1]  S. Geisser,et al.  A Predictive Approach to Model Selection , 1979 .

[2]  Alex Pentland,et al.  Visual Context Awareness via Wearable Computing , 1998 .

[3]  Alex Pentland,et al.  Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  David W. Murray,et al.  Wearable hand activity recognition for event summarization , 2005, Ninth IEEE International Symposium on Wearable Computers (ISWC'05).

[5]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[6]  Michael I. Jordan,et al.  Variational inference for Dirichlet process mixtures , 2006 .

[7]  W. Eric L. Grimson,et al.  Unsupervised Activity Perception by Hierarchical Bayesian Models , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Hal Daumé,et al.  Fast search for Dirichlet process mixture models , 2007, AISTATS.

[9]  Radford M. Neal,et al.  Splitting and merging components of a nonconjugate Dirichlet process mixture model , 2007 .

[10]  Alexei A. Efros,et al.  Unsupervised discovery of visual object class hierarchies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Bernt Schiele,et al.  Discovery of activity patterns using topic models , 2008 .

[12]  Martial Hebert,et al.  Temporal segmentation and activity classification from first-person sensing , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[13]  Bernt Schiele,et al.  Multi Activity Recognition Based on Bodymodel-Derived Primitives , 2009, LoCA.

[14]  Walterio W. Mayol-Cuevas,et al.  High level activity recognition using low resolution wearable vision , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[15]  Michael Beetz,et al.  EYEWATCHME—3D Hand and object tracking for inside out activity analysis , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[16]  Xiaofeng Ren,et al.  Figure-ground segmentation improves handled object recognition in egocentric video , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[17]  Lianming Wang,et al.  Fast Bayesian Inference in Dirichlet Process Mixture Models , 2011, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.