Rate-invariant comparisons of covariance paths for visual speech recognition

An important problem in speech, and generally activity, recognition is to develop analyses that are invariant to the execution rates. We introduce a theoretical framework that provides a parametrization-invariant metric for comparing parametrized paths on Riemannian manifolds. Treating instances of activities as parametrized paths on a Riemannian manifold of covariance matrices, we apply this framework to the problem of visual speech recognition from image sequences. We represent each sequence as a path on the space of covariance matrices, each covariance matrix capturing spatial variability of visual features in a frame, and perform simultaneous pairwise temporal alignment and comparison of paths. This removes the temporal variability and helps provide a robust metric for visual speech classification. We evaluated this idea on the OuluVS database and the rank-1 nearest neighbor classification rate improves from 32% to 57% due to temporal alignment.

[1]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[2]  I. Dryden,et al.  Non-Euclidean statistics for covariance matrices, with applications to diffusion tensor imaging , 2009, 0910.1656.

[3]  Fei-FeiLi,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2008 .

[4]  Yui Man Lui,et al.  Advances in matrix manifolds for computer vision , 2012, Image Vis. Comput..

[5]  Anuj Srivastava,et al.  Shape Analysis of Elastic Curves in Euclidean Spaces , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Xavier Pennec,et al.  A Riemannian Framework for Tensor Computing , 2005, International Journal of Computer Vision.

[7]  Anuj Srivastava,et al.  Fitting smoothing splines to time-indexed, noisy points on nonlinear manifolds , 2012, Image Vis. Comput..

[8]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[9]  J. Jost Riemannian geometry and geometric analysis , 1995 .

[10]  Matti Pietikäinen,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .

[11]  Rama Chellappa,et al.  Rate-Invariant Recognition of Humans and Their Activities , 2009, IEEE Transactions on Image Processing.

[12]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control 3rd Edition, Volume II , 2010 .

[13]  Matti Pietikäinen,et al.  Local spatiotemporal descriptors for visual recognition of spoken phrases , 2007, HCM '07.

[14]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[15]  Fatih Murat Porikli,et al.  Covariance Tracking using Model Update Based on Lie Algebra , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[16]  Fatih Murat Porikli,et al.  Region Covariance: A Fast Descriptor for Detection and Classification , 2006, ECCV.

[17]  Stephen E. Levinson,et al.  A fused hidden Markov model with application to bimodal speech processing , 2004, IEEE Transactions on Signal Processing.

[18]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.