Signing Outside the Studio: Benchmarking Background Robustness for Continuous Sign Language Recognition

The goal of this work is background-robust continuous sign language recognition. Most existing Continuous Sign Language Recognition (CSLR) benchmarks have fixed backgrounds and are filmed in studios with a static monochromatic background. However, signing is not limited only to studios in the real world. In order to analyze the robustness of CSLR models under background shifts, we first evaluate existing state-of-the-art CSLR models on diverse backgrounds. To synthesize the sign videos with a variety of backgrounds, we propose a pipeline to automatically generate a benchmark dataset utilizing existing CSLR benchmarks. Our newly constructed benchmark dataset consists of diverse scenes to simulate a real-world environment. We observe even the most recent CSLR method cannot recognize glosses well on our new dataset with changed backgrounds. In this regard, we also propose a simple yet effective training scheme including (1) background randomization and (2) feature disentanglement for CSLR models. The experimental results on our dataset demonstrate that our method generalizes well to other unseen background data with minimal additional training images.

[1]  Junsik Kim,et al.  Audio-Visual Fusion Layers for Event Type Aware Video Recognition , 2022, ArXiv.

[2]  In-So Kweon,et al.  KSL-Guide: A Large-scale Korean Sign Language Dataset Including Interrogative Sentences for Guiding the Deaf and Hard-of-Hearing , 2021, 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021).

[3]  Bencie Woll,et al.  BBC-Oxford British Sign Language Dataset , 2021, ArXiv.

[4]  Xilin Chen,et al.  Self-Mutual Distillation Learning for Continuous Sign Language Recognition , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Stephen Lin,et al.  ACP++: Action Co-Occurrence Priors for Human-Object Interaction Detection , 2021, IEEE Transactions on Image Processing.

[6]  Peng Cui,et al.  Towards Out-Of-Distribution Generalization: A Survey , 2021, ArXiv.

[7]  Sanghyun Woo,et al.  LabOR: Labeling Only if Required for Domain Adaptive Semantic Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Kihyuk Sohn,et al.  Object-aware Contrastive Learning for Debiased Scene Representation , 2021, NeurIPS.

[9]  Inso Kweon,et al.  DASO: Distribution-Aware Semantics-Oriented Pseudo-label for Imbalanced Semi-Supervised Learning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Donggeun Yoo,et al.  Reducing Domain Gap by Reducing Style Bias , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  In So Kweon,et al.  Dealing with Missing Modalities in the Visual Question Answer-Difference Prediction Task through Knowledge Distillation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[12]  Xiujuan Chai,et al.  Visual Alignment Constraint for Continuous Sign Language Recognition , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Y. Qiao,et al.  Domain Generalization with MixStyle , 2021, ICLR.

[14]  Alvaro Leandro Cavalcante Carneiro,et al.  Efficient sign language recognition system and dataset creation method based on deep learning and image processing , 2021, International Conference on Digital Image Processing.

[15]  Hermann Ney,et al.  Weakly Supervised Learning with Multi-Stream CNN-LSTM-HMMs to Discover Sequential Parallelism in Sign Language Videos , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Xavier Giro-i-Nieto,et al.  How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Yu-Wing Tai,et al.  Fully Convolutional Networks for Continuous Sign Language Recognition , 2020, ECCV.

[18]  Joon Son Chung,et al.  BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues , 2020, ECCV.

[19]  In So Kweon,et al.  Detecting Human-Object Interactions with Action Co-occurrence Priors , 2020, ECCV.

[20]  D. Song,et al.  The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Pierre H. Richemond,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[22]  Kilian Q. Weinberger,et al.  On Feature Normalization and Data Augmentation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Houqiang Li,et al.  Spatial-Temporal Multi-Cue Network for Continuous Sign Language Recognition , 2020, AAAI.

[24]  Chen Gao,et al.  Why Can't I Dance in the Mall? Learning to Mitigate Scene Bias in Action Recognition , 2019, NeurIPS.

[25]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[26]  Ross B. Girshick,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  I. Kweon,et al.  Image Captioning with Very Scarce Supervised Data: Adversarial Semi-Supervised Learning Approach , 2019, EMNLP.

[28]  K. Keutzer,et al.  Domain Randomization and Pyramid Consistency: Simulation-to-Real Generalization Without Accessing Target Domain Data , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Zhaoyang Yang,et al.  SF-Net: Structured Feature Network for Continuous Sign Language Recognition , 2019, ArXiv.

[30]  Dawn Song,et al.  Natural Adversarial Examples , 2019, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Thomas G. Dietterich,et al.  Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , 2019, ICLR.

[32]  Benjamin Recht,et al.  Do ImageNet Classifiers Generalize to ImageNet? , 2019, ICML.

[33]  Sang-Ki Ko,et al.  Neural Sign Language Translation based on Human Keypoint Estimation , 2018, Applied Sciences.

[34]  Matthias Bethge,et al.  ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness , 2018, ICLR.

[35]  Houqiang Li,et al.  Dilated Convolutional Network with Iterative Optimization for Continuous Sign Language Recognition , 2018, IJCAI.

[36]  Hermann Ney,et al.  Neural Sign Language Translation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Tae-Hyun Oh,et al.  Disjoint Multi-task Learning Between Heterogeneous Human-Centric Tasks , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[38]  Wen-gang Zhou,et al.  Video-based Sign Language Recognition without Temporal Segmentation , 2018, AAAI.

[39]  Aleksander Madry,et al.  Exploring the Landscape of Spatial Robustness , 2017, ICML.

[40]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[41]  Yongxin Yang,et al.  Learning to Generalize: Meta-Learning for Domain Generalization , 2017, AAAI.

[42]  Yongxin Yang,et al.  Deeper, Broader and Artier Domain Generalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[43]  Oscar Koller,et al.  SubUNets: End-to-End Hand Shape and Continuous Sign Language Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[44]  Dumitru Erhan,et al.  Unsupervised Pixel-Level Domain Adaptation with Generative Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[46]  Hermann Ney,et al.  Deep Hand: How to Train a CNN on 1 Million Hand Images When Your Data is Continuous and Weakly Labelled , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Hermann Ney,et al.  Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers , 2015, Comput. Vis. Image Underst..

[49]  Yinda Zhang,et al.  LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop , 2015, ArXiv.

[50]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[51]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[52]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[53]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Krista A. Ehinger,et al.  SUN Database: Exploring a Large Collection of Scene Categories , 2014, International Journal of Computer Vision.

[55]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[56]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[57]  Dong-Jin Kim,et al.  Generative Bias for Visual Question Answering , 2022, ArXiv.

[58]  Trevor Darrell,et al.  Tent: Fully Test-Time Adaptation by Entropy Minimization , 2021, ICLR.

[59]  Brian Kan-Wing Mak,et al.  Stochastic Fine-Grained Labeling of Multi-state Sign Glosses for Continuous Sign Language Recognition , 2020, ECCV.

[60]  Boris Katz,et al.  ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models , 2019, NeurIPS.

[61]  Karl-Friedrich Kraiss,et al.  Towards a Video Corpus for Signer-Independent Continuous Sign Language Recognition , 2007 .