Dysarthric speech recognition using a convolutive bottleneck network

In this paper, we investigate the recognition of speech produced by a person with an articulation disorder resulting from athetoid cerebral palsy. The articulation of the first spoken words tends to become unstable due to strain on speech muscles, and that causes a degradation of traditional speech recognition systems. Therefore, we propose a robust feature extraction method using a convolutive bottleneck network (CBN) instead of the well-known MFCC. The CBN stacks multiple various types of layers, such as a convolution layer, a subsampling layer, and a bottleneck layer, forming a deep network. Applying the CBN to feature extraction for dysarthric speech, we expect that the CBN will reduce the influence of the unstable speaking style caused by the athetoid symptoms. We confirmed its effectiveness through word-recognition experiments, where the CBN-based feature extraction method outperformed the conventional feature extraction method.

[1]  Yann LeCun,et al.  Learning long‐range vision for autonomous off‐road driving , 2009, J. Field Robotics.

[2]  Tomoki Toda,et al.  Speaking aid system for total laryngectomees using voice conversion of body transmitted artificial speech , 2006, INTERSPEECH.

[3]  G. Montavon Deep learning for spoken language identification , 2009 .

[4]  Martin Karafiát,et al.  Convolutive Bottleneck Network features for LVCSR , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[5]  Hermann Ney,et al.  Hierarchical bottle neck features for LVCSR , 2010, INTERSPEECH.

[6]  Stephen Cox,et al.  High-level approaches to confidence estimation in speech recognition , 2002, IEEE Trans. Speech Audio Process..

[7]  Diego Giuliani,et al.  Investigating recognition of children's speech , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[8]  Christophe Garcia,et al.  text Detection with Convolutional Neural Networks , 2008, VISAPP.

[9]  Urs A. Muller,et al.  Learning long-range vision for autonomous off-road driving , 2009 .

[10]  Yasuo Horiuchi,et al.  Estimating Syntactic Structure from Prosody in Japanese Speech , 2003 .

[11]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[12]  Md. Khayrul Bashar,et al.  Unsupervised Texture Segmentation via Wavelet-based Locally Orderless Images (WLOIs) and SOM , 2003, Computer Graphics and Imaging.

[13]  Christophe Garcia,et al.  Convolutional face finder: a neural architecture for fast and robust face detection , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[15]  Tetsuya Takiguchi,et al.  Local-feature-map Integration Using Convolutional Neural Networks for Music Genre Classification , 2012, INTERSPEECH.

[16]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[17]  Tetsuya Takiguchi,et al.  Integration of metamodel and acoustic model for speech recognition , 2008, INTERSPEECH.

[18]  Ak SharmaCol,et al.  Campbell's Operative Orthopaedics , 2004 .

[19]  Tetsuya Takiguchi,et al.  Integration of Metamodel and Acoustic Model for Dysarthric Speech Recognition , 2009, J. Multim..

[20]  Shigeru Katagiri,et al.  ATR Japanese speech database as a tool of speech recognition and synthesis , 1990, Speech Commun..