PRAV: A Phonetically Rich Audio Visual Corpus

This paper describes the acquisition of PRAV, a phonetically rich audio-visual Corpus. The PRAV Corpus contains audio as well as visual recordings of 2368 sentences from the TIMIT corpus each spoken by four subjects, making it the largest audiovisual corpus in the literature in terms of the number of sentences per subject. Visual features, comprising the coordinates of points along the contour of the subjects lips, have been extracted for the entire PRAV Corpus using the Active Appearance Models (AAM) algorithm and have been made available along with the audio and video recordings. The subjects being Indian makes PRAV an ideal resource for audio-visual speech study with non-native English speakers. Moreover, this paper describes how the large number of sentences per subject makes the PRAV Corpus a significant dataset by highlighting its utility in exploring a number of potential research problems including visual speech synthesis and perception studies.

[1]  Yoni Bauduin,et al.  Audio-Visual Speech Recognition , 2004 .

[2]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[3]  Prasanta Kumar Ghosh,et al.  A comparative study of articulatory features from facial video and acoustic-to-articulatory inversion for phonetic discrimination , 2016, 2016 International Conference on Signal Processing and Communications (SPCOM).

[4]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[6]  Margaret McRorie,et al.  The Belfast Induced Natural Emotion Database , 2012, IEEE Transactions on Affective Computing.

[7]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  D W Massaro,et al.  American Psychological Association, Inc. Evaluation and Integration of Visual and Auditory Information in Speech Perception , 2022 .

[9]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[10]  Stefan A. Frisch,et al.  An audiovisual database of English speech sounds , 2003 .

[11]  Dominique Estival,et al.  Building an Audio-Visual Corpus of Australian English: Large Corpus Collection with an Economical Portable and Replicable Black Box , 2011, INTERSPEECH.

[12]  A A Montgomery,et al.  Auditory and visual contributions to the perception of consonants. , 1974, Journal of speech and hearing research.

[13]  Masaaki Honda,et al.  Speaker Adaptation Method for Acoustic-to-Articulatory Inversion using an HMM-Based Speech Production Model , 2004, IEICE Trans. Inf. Syst..

[14]  Wesley Mattheyses,et al.  Audiovisual speech synthesis: An overview of the state-of-the-art , 2015, Speech Commun..

[15]  Yochai Konig,et al.  "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Chalapathy Neti,et al.  Audio-visual speech recognition in challenging environments , 2003, INTERSPEECH.

[17]  Cigdem Eroglu Erdem,et al.  A Turkish audio-visual emotional database , 2013, 2013 21st Signal Processing and Communications Applications Conference (SIU).

[18]  Joe Frankel,et al.  Linear dynamic models for automatic speech recognition , 2004 .

[19]  Wesley Mattheyses,et al.  Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis , 2013, Speech Commun..

[20]  Prasanta Kumar Ghosh,et al.  Improved subject-independent acoustic-to-articulatory inversion , 2015, Speech Commun..

[21]  Athanasios Katsamanis,et al.  A Multimodal Real-Time MRI Articulatory Corpus for Speech Research , 2011, INTERSPEECH.

[22]  H. McGurk,et al.  Visual influences on speech perception processes , 1978, Perception & psychophysics.

[23]  Michael Pucher,et al.  Speaker-adaptive visual speech synthesis in the HMM-framework , 2012, INTERSPEECH.

[24]  A. Macleod,et al.  Quantifying the contribution of vision to speech perception in noise. , 1987, British journal of audiology.

[25]  Conrad Sanderson,et al.  The VidTIMIT Database , 2002 .

[26]  George W. Quinn,et al.  Distinguishing identical twins by face recognition , 2011, Face and Gesture 2011.

[27]  John P. Lewis,et al.  Automated eye motion using texture synthesis , 2005, IEEE Computer Graphics and Applications.

[28]  Vladimir Pavlovic,et al.  Boosting and structure learning in dynamic Bayesian networks for audio-visual speaker detection , 2002, Object recognition supported by user interaction for service robots.

[29]  Ricardo Gutierrez-Osuna,et al.  Audio/visual mapping with cross-modal hidden Markov models , 2005, IEEE Transactions on Multimedia.

[30]  Matti Pietikäinen,et al.  Bi-Modal Person Recognition on a Mobile Phone: Using Mobile Phone Data , 2012, 2012 IEEE International Conference on Multimedia and Expo Workshops.

[31]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[32]  Matti Pietikäinen,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .

[33]  Jiri Matas,et al.  XM2VTSDB: The Extended M2VTS Database , 1999 .

[34]  Matti Pietikäinen,et al.  OuluVS2: A multi-view audiovisual database for non-rigid mouth motion analysis , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[35]  Tony Ezzat,et al.  Visual Speech Synthesis by Morphing Visemes , 2000, International Journal of Computer Vision.

[36]  Louis D. Braida,et al.  Evaluating the articulation index for auditory-visual input. , 1987, The Journal of the Acoustical Society of America.