Unsupervised Random Forest Manifold Alignment for Lipreading

Lip reading from visual channels remains a challenging topic considering the various speaking characteristics. In this paper, we address an efficient lip reading approach by investigating the unsupervised random forest manifold alignment (RFMA). The density random forest is employed to estimate affinity of patch trajectories in speaking facial videos. We propose novel criteria for node splitting to avoid the rank-deficiency in learning density forests. By virtue of the hierarchical structure of random forests, the trajectory affinities are measured efficiently, which are used to find embeddings of the speaking video clips by a graph-based algorithm. Lip reading is formulated as matching between manifolds of query and reference video clips. We employ the manifold alignment technique for matching, where the L∞-norm-based manifold-to-manifold distance is proposed to find the matching pairs. We apply this random forest manifold alignment technique to various video data sets captured by consumer cameras. The experiments demonstrate that lip reading can be performed effectively, and outperform state-of-the-arts.

[1]  Luc Van Gool,et al.  Hough Forests for Object Detection, Tracking, and Action Recognition , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Stephen J. Cox,et al.  The challenge of multispeaker lip-reading , 2008, AVSP.

[3]  Trevor Darrell,et al.  Visual speech recognition with loosely synchronized feature streams , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[4]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  Matti Pietikäinen,et al.  Lipreading: A Graph Embedding Approach , 2010, 2010 20th International Conference on Pattern Recognition.

[8]  Matti Pietikäinen,et al.  Towards a practical lipreading system , 2011, CVPR 2011.

[9]  Hongbin Zha,et al.  Unsupervised Image Matching Based on Manifold Alignment , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Ronald R. Coifman,et al.  Data Fusion and Multicue Data Matching by Diffusion Maps , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Matti Pietikäinen,et al.  A comparative study of texture measures with classification based on featured distributions , 1996, Pattern Recognit..

[12]  Ron Kimmel,et al.  Representation Analysis and Synthesis of Lip Images Using Dimensionality Reduction , 2006, International Journal of Computer Vision.

[13]  Antonio Criminisi,et al.  Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning , 2012, Found. Trends Comput. Graph. Vis..

[14]  Nicholas Roy,et al.  Global A-Optimal Robot Exploration in SLAM , 2005, Proceedings of the 2005 IEEE International Conference on Robotics and Automation.

[15]  Yousef Saad,et al.  Fast Approximate kNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection , 2009, J. Mach. Learn. Res..

[16]  Matti Pietikäinen,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .

[17]  Daniel Rueckert,et al.  Random Forest-Based Manifold Learning for Classification of Imaging Data in Dementia , 2011, MLMI.

[18]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[19]  Frédéric Jurie,et al.  Fast Discriminative Visual Codebooks using Randomized Clustering Forests , 2006, NIPS.

[20]  Kai Li,et al.  Efficient k-nearest neighbor graph construction for generic similarity measures , 2011, WWW.

[21]  Gang Yu,et al.  Unsupervised random forest indexing for fast action search , 2011, CVPR 2011.

[22]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  Daniel D. Lee,et al.  Semisupervised alignment of manifolds , 2005, AISTATS.

[24]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[25]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).