Speech recognition with large-scale speaker-class-based acoustic modeling

This paper investigates speaker-independent speech recognition with speaker-class models. In previous studies based on this method, the number of speaker classes was relatively small and it was difficult to improve the performance significantly over the baseline. In this work, as many as 500 speaker-class models are used to enable more precise modeling of speaker characteristics. In order to avoid a lack of training data for each speaker-class model, a soft clustering technique is used in which a training speaker is allowed to belong to several classes. In the recognition experiments, a slight improvement in performance was obtained using a conventional method with several tens of speaker-class models. In contrast, a significant improvement was obtained using an unsupervised soft clustering method with several hundred speaker-class models. In addition, the results indicated a possibility of reducing the error rate drastically if the speaker-class model selection was conducted more effectively.

[1]  Kiyohiro Shikano,et al.  Unsupervised speaker adaptation based on sufficient HMM statistics of selected speakers , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[2]  Koichi Shinoda,et al.  Speaker Selection for Unsupervised Speaker Adaptation based on HMM Sufficient Statistics , 2007 .

[3]  Denis Jouvet,et al.  Classification margin for improved class-based speech recognition performance , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Seiichi Nakagawa,et al.  Soft-clustering technique for training data in Age-and gender-independent speech recognition , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[5]  Tetsuo Kosaka,et al.  Speaker-independent speech recognition based on tree-structured speaker clustering , 1996, Comput. Speech Lang..

[6]  Tetsuo Kosaka,et al.  Speaker adaptation based on system combination using speaker-class models , 2010, INTERSPEECH.

[7]  Vassilios Digalakis,et al.  Training data clustering for improved speech recognition , 1995, EUROSPEECH.

[8]  Kiyohiro Shikano,et al.  Isolated word recognition using phoneme-like templates , 1983, ICASSP.

[9]  Michael Picheny,et al.  Speaker clustering and transformation for speaker adaptation in speech recognition systems , 1998, IEEE Trans. Speech Audio Process..