In China, there are many different kinds of dialects and sub-dialects. Because there are many grammatical, lexical, phonological, and phonetic differences among them in varying degrees, people from different dialect regions always have difficulties in oral communica- tion. Since 1956, standard Mandarin has been popularized all over the country as official language and almost every dialect speaker began to learn Mandarin just as a second language. But affected by their native dialects, many of them speak Mandarin with regional accents. In modern speech processing technologies, speech is represented by spectrum which contains not only the dialectal linguistic information but also extra-linguistic information such as the gender and age of the speaker. In order to focus exclusively on the linguistic features of dialec- tal utterances, a speaker-invariant structural representation of speech, which was originally proposed by the second author inspired by in- fants' language acquisition (1, 2), is proposed to represent the pronunciation of Chinese dialect speakers. Since the purely dialectal informa- tion can be extracted by removing the extra-linguistic information from dialect speech, this pronunciation structure can be applied to esti- mate which dialect or sub-dialect region a speaker belongs to and to assess the pronunciation. In order to testify the validity of our approach, speaker classification based on the dialectal utterances of 38 Chinese finals are investigated especially in terms of robustness to speaker variability. The result is linguistically reasonable and highly independent of age and gender. After that, a sub-dialect corpus is developed with a list of characters as reading materials, which is originally used for linguists' investigation of dialect speakers' pronunciation. Then after the sub-dialect pronunciation structure is built for every speaker, their pronunciations are classified based on the distances among their structures. The result shows that the sub-dialect speakers can also be linguistically classified with little influence of their age and gender. In conclusion, this structural representation of Chinese dialects can extract the purely dialectal and sub-dialectal information from speech and works well on dialect-based and sub-dialect-based speaker classification. Index Terms: Chinese dialects, extra-linguistic feature, pronunciation structure, Bhattacharyya distance, speaker classification
[1]
Hermann Ney,et al.
Vocal tract normalization equals linear transformation in cepstral space
,
2001,
IEEE Transactions on Speech and Audio Processing.
[2]
Nobuaki Minematsu,et al.
F-divergence Is a Generalized Invariant Measure between Distributions
,
2008,
INTERSPEECH.
[3]
K. Nishinari,et al.
THEOREM OF THE INVARIANT STRUCTURE AND ITS DERIVATION OF SPEECH GESTALT
,
2005
.
[4]
Keikichi Hirose,et al.
STRUCTURAL REPRESENTATION OF THE PRONUNCIATION AND ITS USE FOR CALL
,
2006,
2006 IEEE Spoken Language Technology Workshop.
[5]
Keikichi Hirose,et al.
Multi-stream parameterization for structural speech recognition
,
2008,
2008 IEEE International Conference on Acoustics, Speech and Signal Processing.
[6]
Nobuaki Minematsu.
Mathematical evidence of the acoustic universal structure in speech
,
2005,
Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..
[7]
P. Jusczyk.
The discovery of spoken language
,
1997
.
[8]
Richard VanNess Simmons,et al.
汉语方言词汇调查手册 = Handbook for lexicon based dialect fieldwork
,
2006
.