Rate-Invariant Analysis of Trajectories on Riemannian Manifolds with Application in Visual Speech Recognition

In statistical analysis of video sequences for speech recognition, and more generally activity recognition, it is natural to treat temporal evolutions of features as trajectories on Riemannian manifolds. However, different evolution patterns result in arbitrary parameterizations of these trajectories. We investigate a recent framework from statistics literature that handles this nuisance variability using a cost function/distance for temporal registration and statistical summarization & modeling of trajectories. It is based on a mathematical representation of trajectories, termed transported square-root vector field (TSRVF), and the L2 norm on the space of TSRVFs. We apply this framework to the problem of speech recognition using both audio and visual components. In each case, we extract features, form trajectories on corresponding manifolds, and compute parametrization-invariant distances using TSRVFs for speech classification. On the OuluVS database the classification performance under metric increases significantly, by nearly 100% under both modalities and for all choices of features. We obtained speaker-dependent classification rate of 70% and 96% for visual and audio components, respectively.

[1]  Huiling Le,et al.  The Fréchet mean shape and the shape of the means , 2000, Advances in Applied Probability.

[2]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control 3rd Edition, Volume II , 2010 .

[3]  Rachid Deriche,et al.  A robust variational approach for simultaneous smoothing and estimation of DTI , 2013, NeuroImage.

[4]  Matti Pietikäinen,et al.  Local spatiotemporal descriptors for visual recognition of spoken phrases , 2007, HCM '07.

[5]  Xavier Pennec,et al.  A Riemannian Framework for Tensor Computing , 2005, International Journal of Computer Vision.

[6]  Fernando De la Torre,et al.  Generalized time warping for multi-modal alignment of human motion , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Anuj Srivastava,et al.  Statistical analysis of trajectories on Riemannian manifolds: Bird migration, hurricane tracking and video surveillance , 2014, 1405.0803.

[8]  Anuj Srivastava,et al.  Fitting smoothing splines to time-indexed, noisy points on nonlinear manifolds , 2012, Image Vis. Comput..

[9]  Gary E. Christensen,et al.  Consistent image registration , 2001, IEEE Transactions on Medical Imaging.

[10]  Rama Chellappa,et al.  Rate-Invariant Recognition of Humans and Their Activities , 2009, IEEE Transactions on Image Processing.

[11]  Matti Pietikäinen,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .

[12]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[13]  P. Jupp,et al.  Fitting Smooth Paths to Spherical Data , 1987 .

[14]  Anuj Srivastava,et al.  Statistical Shape Analysis , 2014, Computer Vision, A Reference Guide.

[15]  Wei Wu,et al.  Generative models for functional data using phase and amplitude separation , 2012, Comput. Stat. Data Anal..

[16]  Fatih Murat Porikli,et al.  Covariance Tracking using Model Update Based on Lie Algebra , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[17]  Fatih Murat Porikli,et al.  Region Covariance: A Fast Descriptor for Detection and Classification , 2006, ECCV.

[18]  Alain Trouvé,et al.  Computing Large Deformation Metric Mappings via Geodesic Flows of Diffeomorphisms , 2005, International Journal of Computer Vision.

[19]  J. Jost Riemannian geometry and geometric analysis , 1995 .

[20]  Anuj Srivastava,et al.  Shape Analysis of Elastic Curves in Euclidean Spaces , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.