Comparing Visual, Textual, and Multimodal Features for Detecting Sign Language in Video Sharing Sites

Easy recording and sharing of video content has led to the creation and distribution of increasing quantities of sign language (SL) content. Current capabilities make locating SL videos on a desired topic dependent on the existence and correctness of metadata indicating both the language and topic of the video. Automated techniques to detect sign language content can aid this problem. This paper compares metadata-based classifiers and multimodal classifiers, using both early and late fusion techniques, with video content-based classifiers in the literature. Comparisons of applying TF-IDF, LDA, and NMF in the generation of metadata features indicates that NMF performs best, either when used independently or when combined with video features. Multimodal classifiers perform better than unimodal SL video classifiers. Experiments show multimodal features obtained results of up to 86% precision, 81% recall, and 84% F1 score. This represents an improvement on F1 score of roughly 9% in comparison with the video-based approach presented in the literature and an improvement of 6% over text-based features extracted using NMF.

[1]  Manfred Georg,et al.  On using nearly-independent feature families for high precision and confidence , 2012, Machine Learning.

[2]  Wei-Hao Lin,et al.  News video classification using SVM-based multimodal classifiers and combination strategies , 2002, MULTIMEDIA '02.

[3]  Frank M. Shipman,et al.  Detection of sign-language content in video through polar motion profiles , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Peter Wittenburg,et al.  Unsupervised Feature Learning for Visual Sign Language Identification , 2014, ACL.

[5]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[6]  Catherine C. Marshall,et al.  No bull, no spin: a comparison of tags with other forms of user metadata , 2009, JCDL '09.

[7]  Frank M. Shipman,et al.  Towards a Distributed Digital Library for Sign Language Content , 2015, JCDL.

[8]  Frank M. Shipman,et al.  Detecting and Identifying Sign Languages through Visual Features , 2016, 2016 IEEE International Symposium on Multimedia (ISM).

[9]  David A. Shamma,et al.  Knowing funny: genre perception and categorization in social video sharing , 2011, CHI.

[10]  Chih-Jen Lin,et al.  Projected Gradient Methods for Nonnegative Matrix Factorization , 2007, Neural Computation.

[11]  Petros Maragos,et al.  Automatic sign language recognition: vision based feature extraction and probabilistic recognition scheme from multiple cues , 2008, PETRA '08.

[12]  Shaogang Gong,et al.  Learning from Multiple Sources for Video Summarisation , 2015, International Journal of Computer Vision.

[13]  Vassilis Athitsos,et al.  Nearest neighbor search methods for handshape recognition , 2008, PETRA '08.

[14]  Frank M. Shipman,et al.  Design and evaluation of classifier for identifying sign language videos in video sharing sites , 2012, ASSETS '12.

[15]  Zoran Zivkovic,et al.  Improved adaptive Gaussian mixture model for background subtraction , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[16]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[17]  Petros Maragos,et al.  Sign Language Recognition, Generation, and Modelling: A Research Effort with Applications in Deaf Communication , 2009, HCI.

[18]  Jose L. Hernandez-Rebollar Gesture-driven American sign language phraselator , 2005, ICMI '05.