The Jester Dataset: A Large-Scale Video Dataset of Human Gestures

Gesture recognition and its application in human-computer interfaces have been growing increasingly popular in recent years. Although many gestures can be recognized from a single image frame, to build a responsive, accurate system, that can recognize complex gestures with subtle differences between them we need large-scale real-world video datasets. In this work, we introduce the largest collection of short clips of videos of humans performing gestures in front of the camera. The dataset has been collected with the help of over 1300 different actors in their unconstrained environments. Additionally, we present an on-going gesture recognition challenge based on our dataset and the current results. We also describe how a baseline achieving over 93% recognition accuracy can be obtained with a simple 3D convolutional neural network.

[1]  Hanqing Lu,et al.  EgoGesture: A New Dataset and Benchmark for Egocentric Hand Gesture Recognition , 2018, IEEE Transactions on Multimedia.

[2]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[4]  Gang Wang,et al.  SSNet: Scale Selection Network for Online 3D Action Prediction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Тараса Шевченка,et al.  Quo vadis? , 2013, Clinical chemistry.

[6]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  Yale Song,et al.  Tracking body and hands for gesture recognition: NATOPS aircraft handling signals database , 2011, Face and Gesture 2011.

[8]  Qi Ye,et al.  BigHand2.2M Benchmark: Hand Pose Dataset and State of the Art Analysis , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  H. R. Nandi Vardhan,et al.  Smart gloves for hand gesture recognition: Sign language to speech conversion system , 2016, 2016 International Conference on Robotics and Automation for Humanitarian Applications (RAHA).

[10]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Ling Shao,et al.  Learning Discriminative Representations from RGB-D Video Data , 2013, IJCAI.

[12]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Yutaka Satoh,et al.  Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[14]  Pyeong-Gook Jung,et al.  A Wearable Gesture Recognition Device for Detecting Muscular Activities Based on Air-Pressure Sensors , 2015, IEEE Transactions on Industrial Informatics.

[15]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Hermann Ney,et al.  Benchmark Databases for Video-Based Automatic Sign Language Recognition , 2008, LREC.

[17]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[18]  Shanxin Yuan,et al.  First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Pavlo Molchanov,et al.  Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Elena Mugellini,et al.  ChAirGest: a challenge for multimodal mid-air gesture recognition for close HCI , 2013, ICMI '13.

[21]  Tae-Kyun Kim,et al.  Canonical Correlation Analysis of Video Volume Tensors for Action Categorization and Detection , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[23]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[24]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[25]  Sergio Escalera,et al.  ChaLearn Looking at People RGB-D Isolated and Continuous Datasets for Gesture Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[26]  Sergio Escalera,et al.  ChaLearn multi-modal gesture recognition 2013: grand challenge and workshop summary , 2013, ICMI '13.

[27]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Qiang Wang,et al.  Temporal Pyramid Relation Network for Video-Based Gesture Recognition , 2018, 2018 25th IEEE International Conference on Image Processing (ICIP).

[29]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Brendan O'Flynn,et al.  Wearable Human Computer Interface for Control Within Immersive VAMR Gaming Environments Using Data Glove and Hand Gestures , 2018, 2018 IEEE Games, Entertainment, Media Conference (GEM).

[31]  Stan Sclaroff,et al.  The American Sign Language Lexicon Video Dataset , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[32]  Ioan Tudosa,et al.  Development of a wireless glove based on RFID Sensor , 2018, 2018 International Conference on Applied and Theoretical Electricity (ICATE).

[33]  Philippe Dreuw Continuous Sign Language Recognition Approaches from Speech Recognition , 2006 .

[34]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[35]  Nojun Kwak,et al.  Motion Feature Network: Fixed Motion Filter for Action Recognition , 2018, ECCV.