A Dynamically Expandable, Weakly Supervised, Audio-Visual Database of Stuttered Speech

Stuttering affects at least 1% of the world population. It is caused by irregular disruptions in speech production. These interruptions occur in various forms and frequencies. Repetition of words or parts of words, prolongations, or blocks in getting the words out are the most common ones. Accurate detection and classification of stuttering would be important in the assessment of severity for speech therapy. Furthermore, real time detection might create many new possibilities to facilitate reconstruction into fluent speech. Such an interface could help people to utilize voice-based interfaces like Apple Siri and Google Assistant, or to make (video) phone calls more fluent by delayed delivery. In this paper we present the first expandable audio-visual database of stuttered speech. We explore an end-to-end, real-time, multi-modal model for detection and classification of stuttered blocks in unbound speech. We also make use of video signals since acoustic signals cannot be produced immediately. We use multiple modalities as acoustic signals together with secondary characteristics exhibited in visual signals will permit an increased accuracy of detection.

[1]  Peter Howell,et al.  The University College London Archive of Stuttered Speech (UCLASS). , 2009, Journal of speech, language, and hearing research : JSLHR.

[2]  B. MacWhinney,et al.  Fluency Bank: A new resource for fluency research and practice. , 2018, Journal of fluency disorders.

[3]  P. Mahesha,et al.  Automatic Segmentation and Classification of Dysfluencies in Stuttering Speech , 2016, ICTCS '16.

[4]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Tetsuya Takiguchi,et al.  Audio-Visual Speech Recognition Using Bimodal-Trained Bottleneck Features for a Person with Severe Hearing Loss , 2016, INTERSPEECH.

[6]  Peter A. Heeman,et al.  Using Clinician Annotations to Improve Automatic Speech Recognition of Stuttered Speech , 2016, INTERSPEECH.

[7]  Vaibhava Goel,et al.  Deep multimodal learning for Audio-Visual Speech Recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Phil D. Green,et al.  A Lightly Supervised Approach to Detect Stuttering in Children's Speech , 2018, INTERSPEECH.

[9]  Mari Ostendorf,et al.  Disfluency Detection Using a Bidirectional LSTM , 2016, INTERSPEECH.

[10]  Phil D. Green,et al.  Detecting Stuttering Events in Transcripts of Children's Speech , 2017, SLSP.

[11]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[12]  Amirhossein Hajavi,et al.  A Deep Neural Network for Short-Segment Speaker Recognition , 2019, INTERSPEECH.

[13]  J. Kalinowski,et al.  How effective is therapy for childhood stuttering? Dissecting and reinterpreting the evidence in light of spontaneous recovery rates. , 2005, International journal of language & communication disorders.

[14]  Frank Rudzicz,et al.  The TORGO database of acoustic and articulatory speech from speakers with dysarthria , 2011, Language Resources and Evaluation.

[15]  James R. Glass,et al.  Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech , 2018, INTERSPEECH.

[16]  Xuelong Li,et al.  Temporal Multimodal Learning in Audiovisual Speech Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[18]  A. Packman,et al.  Comparisons of audio and audiovisual measures of stuttering frequency and severity in preschool-age children. , 2008, American journal of speech-language pathology.