Online Early-Late Fusion Based on Adaptive HMM for Sign Language Recognition

In sign language recognition (SLR) with multimodal data, a sign word can be represented by multiply features, for which there exist an intrinsic property and a mutually complementary relationship among them. To fully explore those relationships, we propose an online early-late fusion method based on the adaptive Hidden Markov Model (HMM). In terms of the intrinsic property, we discover that inherent latent change states of each sign are related not only to the number of key gestures and body poses but also to their translation relationships. We propose an adaptive HMM method to obtain the hidden state number of each sign by affinity propagation clustering. For the complementary relationship, we propose an online early-late fusion scheme. The early fusion (feature fusion) is dedicated to preserving useful information to achieve a better complementary score, while the late fusion (score fusion) uncovers the significance of those features and aggregates them in a weighting manner. Different from classical fusion methods, the fusion is query adaptive. For different queries, after feature selection (including the combined feature), the fusion weight is inversely proportional to the area under the curve of the normalized query score list for each selected feature. The whole fusion process is effective and efficient. Experiments verify the effectiveness on the signer-independent SLR with large vocabulary. Compared either on different dataset sizes or to different SLR models, our method demonstrates consistent and promising performance.

[1]  Xi Wang,et al.  Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification , 2016, ACM Multimedia.

[2]  Changsheng Xu,et al.  Discriminative Exemplar Coding for Sign Language Recognition With Kinect , 2013, IEEE Transactions on Cybernetics.

[3]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Qi Tian,et al.  Query-adaptive late fusion for image search and person re-identification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Chao Xie,et al.  Chinese sign language recognition with adaptive HMM , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[6]  Philip Chan,et al.  Toward accurate dynamic time warping in linear time and space , 2007, Intell. Data Anal..

[7]  Houqiang Li,et al.  Sign language recognition with long short-term memory , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[8]  Ernest Valveny,et al.  Optimal Classifier Fusion in a Non-Bayesian Probabilistic Framework , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Gang Hua,et al.  Can Visual Recognition Benefit from Auxiliary Information in Training? , 2014, ACCV.

[10]  Bingbing Ni,et al.  First-Person Daily Activity Recognition With Manipulated Object Proposals and Non-Linear Feature Fusion , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[11]  Ming Yang,et al.  Query Specific Fusion for Image Retrieval , 2012, ECCV.

[12]  Andrew Zisserman,et al.  Large-scale Learning of Sign Language by Watching TV (Using Co-occurrences) , 2013, BMVC.

[13]  Meng Wang,et al.  Unified Video Annotation via Multigraph Learning , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[14]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Gang Hua,et al.  Multi-View Visual Recognition of Imperfect Testing Data , 2015, ACM Multimedia.

[16]  Oscar Koller,et al.  Using Convolutional 3D Neural Networks for User-independent continuous gesture recognition , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[17]  Jun Wang,et al.  Exploring Inter-feature and Inter-class Relationships with Deep Neural Networks for Video Classification , 2014, ACM Multimedia.

[18]  Junsong Yuan,et al.  Robust hand gesture recognition based on finger-earth mover's distance with a commodity depth camera , 2011, ACM Multimedia.

[19]  Giulio Paci,et al.  A Multi-scale Approach to Gesture Detection and Recognition , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[20]  Yanning Zhang,et al.  Convolutional Neural Network-Based Robot Navigation Using Uncalibrated Spherical Images , 2017, Sensors.

[21]  Jitendra Malik,et al.  Color- and texture-based image segmentation using EM and its application to content-based image retrieval , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[22]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Changsheng Xu,et al.  Semantic Feature Mining for Video Event Understanding , 2016, ACM Trans. Multim. Comput. Commun. Appl..

[24]  Pavlo Molchanov,et al.  Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Helena M. Mentis,et al.  Instructing people for training gestural interactive systems , 2012, CHI.

[26]  Meng Wang,et al.  Sign language recognition based on adaptive HMMS with data augmentation , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[27]  Fahad Shahbaz Khan,et al.  Color attributes for object detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Meng Wang,et al.  Multimodal Graph-Based Reranking for Web Image Search , 2012, IEEE Transactions on Image Processing.

[29]  Kien A. Hua,et al.  A Temporal Order Modeling Approach to Human Action Recognition from Multimodal Sensor Data , 2017, ACM Trans. Multim. Comput. Commun. Appl..

[30]  Qi Tian,et al.  Scalable Object Retrieval with Compact Image Representation from Generic Object Regions , 2015, ACM Trans. Multim. Comput. Commun. Appl..

[31]  Stan Sclaroff,et al.  A Unified Framework for Gesture Recognition and Spatiotemporal Gesture Segmentation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Tarik Arici,et al.  Gesture Recognition using Skeleton Data with Weighted Dynamic Time Warping , 2013, VISAPP.

[33]  Xin Zhao,et al.  Structured Streaming Skeleton -- A New Feature for Online Human Gesture Recognition , 2014, TOMM.

[34]  Changsheng Xu,et al.  Latent Support Vector Machine Modeling for Sign Language Recognition with Kinect , 2015, ACM Trans. Intell. Syst. Technol..

[35]  Sander Dieleman,et al.  Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video , 2015, International Journal of Computer Vision.

[36]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[37]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[38]  Sergio Escalera,et al.  ChaLearn Looking at People RGB-D Isolated and Continuous Datasets for Gesture Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[39]  Fahad Shahbaz Khan,et al.  Modulating Shape Features by Color Attention for Object Recognition , 2012, International Journal of Computer Vision.

[40]  Xilin Chen,et al.  Fast sign language recognition benefited from low rank approximation , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[41]  Ling Shao,et al.  Deep Dynamic Neural Networks for Multimodal Gesture Segmentation and Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Lei Wu,et al.  Effective Active Skeleton Representation for Low Latency Human Action Recognition , 2016, IEEE Transactions on Multimedia.

[43]  Christian Wolf,et al.  ModDrop: Adaptive Multi-Modal Gesture Recognition , 2014, IEEE Trans. Pattern Anal. Mach. Intell..

[44]  Z. Liu,et al.  A real time system for dynamic hand gesture recognition with a depth sensor , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[45]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[46]  Yi-Ping Hung,et al.  Recognizing Human Actions with Outlier Frames by Observation Filtering and Completion , 2017, ACM Trans. Multim. Comput. Commun. Appl..

[47]  Xilin Chen,et al.  Curve Matching from the View of Manifold for Sign Language Recognition , 2014, ACCV Workshops.

[48]  Lu Yang,et al.  Survey on 3D Hand Gesture Recognition , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[49]  Houqiang Li,et al.  Sign Language Recognition using 3D convolutional neural networks , 2015, 2015 IEEE International Conference on Multimedia and Expo (ICME).

[50]  Fuchun Sun,et al.  HyperNet: Towards Accurate Region Proposal Generation and Joint Object Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Wei Liu,et al.  Noise resistant graph ranking for improved web image search , 2011, CVPR 2011.

[52]  Anil K. Jain,et al.  Likelihood Ratio-Based Biometric Score Fusion , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[53]  Arun Ross,et al.  Score normalization in multimodal biometric systems , 2005, Pattern Recognit..

[54]  Qi Tian,et al.  Uniting Keypoints: Local Visual Information Fusion for Large-Scale Image Search , 2015, IEEE Transactions on Multimedia.

[55]  Sergio Escalera,et al.  ChaLearn Looking at People Challenge 2014: Dataset and Results , 2014, ECCV Workshops.