In this paper, we proposed an automatically segmenting and transcribing spontaneous speech signal without the use of manually annotated speech database. The spontaneous speech signal is first segmented into syllable-like units by considering short-term energy as a magnitude spectrum of some arbitrary signal. Similar syllable segments are then grouped together using an unsupervised incremental clustering technique. Separate models are generated for each cluster of syllable segments. At this stage, labels are assigned for each group of syllable segments manually. The syllable models of these clusters are then used to transcribe or recognize the spontaneous speech signal of closed-set speakers' data as well open-set speaker data. As a syllable recognizer, our initial results on Standard Malay television (TV3) news bulletins of the native and non-native speakers shows that the performance is 42.53% and 30.8% respectively.
[1]
Steven Greenberg,et al.
Automatic phonetic transcription of spontaneous speech (american English)
,
2000,
INTERSPEECH.
[2]
Hermann Ney,et al.
Unsupervised training of acoustic models for large vocabulary continuous speech recognition
,
2005,
IEEE Transactions on Speech and Audio Processing.
[3]
Rajesh M. Hegde,et al.
Segmentation of speech into syllable-like units
,
2003,
INTERSPEECH.
[4]
Michael Riley,et al.
Automatic segmentation and labeling of speech
,
1991,
[Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.
[5]
Jean-Luc Gauvain,et al.
Unsupervised acoustic model training
,
2002,
2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.