LRS3-TED: a large-scale dataset for visual speech recognition

This paper introduces a new multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of TED and TEDx videos, along with the corresponding subtitles and word alignment boundaries. The new dataset is substantially larger in scale compared to other public datasets that are available for general research.

[1]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[3]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[4]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[5]  Rainer Lienhart,et al.  Reliable Transition Detection in Videos: A Survey and Practitioner's Guide , 2001, Int. J. Image Graph..

[6]  Themos Stafylakis,et al.  Combining Residual Networks with LSTMs for Lipreading , 2017, INTERSPEECH.

[7]  Andrzej Czyzewski,et al.  An audio-visual corpus for multimodal automatic speech recognition , 2017, Journal of Intelligent Information Systems.

[8]  Mark Liberman,et al.  Speaker identification on the SCOTUS corpus , 2008 .

[9]  Shimon Whiteson,et al.  LipNet: Sentence-level Lipreading , 2016, ArXiv.

[10]  Thomas Paine,et al.  Large-Scale Visual Speech Recognition , 2018, INTERSPEECH.

[11]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[12]  Joon Son Chung,et al.  Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Shimon Whiteson,et al.  LipNet: End-to-End Sentence-level Lipreading , 2016, 1611.01599.

[14]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..