How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language

One of the factors that have hindered progress in the areas of sign language recognition, translation, and production is the absence of large annotated datasets. Towards this end, we introduce How2Sign, a multimodal and multiview continuous American Sign Language (ASL) dataset, consisting of a parallel corpus of more than 80 hours of sign language videos and a set of corresponding modalities including speech, English transcripts, and depth. A three-hour subset was further recorded in the Panoptic studio enabling detailed 3D pose estimation. To evaluate the potential of How2Sign for real-world impact, we conduct a study with ASL signers and show that synthesized videos using our dataset can indeed be understood. The study further gives insights on challenges that computer vision should address in order to make progress in this field.

[1]  W. Stokoe,et al.  Sign language structure: an outline of the visual communication systems of the American deaf. 1960. , 1961, Journal of deaf studies and deaf education.

[2]  Jordan Fenlon,et al.  Building the British Sign Language Corpus , 2013 .

[3]  Hermann Ney,et al.  Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers , 2015, Comput. Vis. Image Underst..

[4]  John A. Albertini,et al.  Deafness and Hearing Loss , 2010 .

[5]  Hermann Ney,et al.  Neural Sign Language Translation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Hermann Ney,et al.  Efficient approximations to model-based joint tracking and recognition of continuous sign language , 2008, 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition.

[7]  Ben Saunders,et al.  Adversarial Training for Multi-Channel Sign Language Production , 2020, BMVC.

[8]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[9]  Sherman Wilcox,et al.  Gesture and the Nature of Language , 1995 .

[10]  J. Reilly,et al.  American Sign Language and Pidgin Sign English: What’s the Difference ? , 2013 .

[11]  Avinash C. Kak,et al.  Proceedings of IEEE International Conference on Multimodel Interfaces, 2002 , 2022 .

[12]  Carol Neidle,et al.  A new web interface to facilitate access to corpora: development of the ASLLRP data access interface , 2012 .

[13]  Publishing DGS corpus data: Different Formats for Different Needs , .

[14]  Jan Kautz,et al.  High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[16]  James R. Mallory,et al.  Overcoming Communication Barriers: Communicating with Deaf People , 1992, Libr. Trends.

[17]  Jie Huang,et al.  Video-based Sign Language Recognition without Temporal Segmentation , 2018, AAAI.

[18]  Onno Crasborn,et al.  Enhanced ELAN functionality for sign language corpora , 2008, LREC 2008.

[19]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Oscar Koller,et al.  Sign Language Transformers: Joint End-to-End Sign Language Recognition and Translation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Stan Sclaroff,et al.  The American Sign Language Lexicon Video Dataset , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[22]  Necati Cihan Camgöz,et al.  Progressive Transformers for End-to-End Sign Language Production , 2020, ECCV.

[23]  Alexei A. Efros,et al.  Everybody Dance Now , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Ben Saunders,et al.  Everybody Sign Now: Translating Spoken Language to Photo Realistic Sign Language Video , 2020, ArXiv.

[25]  Yi Yang,et al.  Articulated Human Detection with Flexible Mixtures of Parts , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Joon Son Chung,et al.  BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues , 2020, European Conference on Computer Vision.

[27]  G. Valentine,et al.  Language Barriers: Exploring the Worlds of the Deaf , 2001 .

[28]  Florian Metze,et al.  How2: A Large-scale Dataset for Multimodal Language Understanding , 2018, NIPS 2018.

[29]  James C. Woodward,et al.  Implications for Sociolinguistic Research among the Deaf , 2013 .

[30]  Helen Cooper,et al.  Learning signs from subtitles: A weakly supervised approach to sign language recognition , 2009, CVPR.

[31]  John Glauert,et al.  Dicta-Sign – Building a Multilingual Sign Language Corpus , 2012 .

[32]  Sang-Ki Ko,et al.  Neural Sign Language Translation based on Human Keypoint Estimation , 2018, Applied Sciences.

[33]  Marc Schulder,et al.  Extending the Public DGS Corpus in Size and Depth , 2020, SIGNLANG.

[34]  Philippe Dreuw Continuous Sign Language Recognition Approaches from Speech Recognition , 2006 .

[35]  Necati Cihan Camgoz,et al.  Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks , 2020, International Journal of Computer Vision.

[36]  Scott K. Liddell Grammar, Gesture, and Meaning in American Sign Language , 2003 .

[37]  Jan Zelinka,et al.  Neural Sign Language Synthesis: Words Are Our Glosses , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[38]  Meredith Ringel Morris,et al.  Sign Language Recognition, Generation, and Translation: An Interdisciplinary Perspective , 2019, ASSETS.

[39]  Xin Yu,et al.  Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[40]  Jan Zelinka,et al.  NN-Based Czech Sign Language Synthesis , 2019, SPECOM.

[41]  Takeo Kanade,et al.  Panoptic Studio: A Massively Multiview System for Social Motion Capture , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).