Analysis and automatic recognition of tones in mandarin chinese
暂无分享,去创建一个
In tonal languages such as Mandarin Chinese, words are defined by their phonemic sequence and by the intonational patterns (tones) of their syllables.
To see if the problem of tone recognition is worth solving, we propose an information theoretic measure to compare the relative importance (Functional Load) of phonological contrasts in any language. Empirical calculations show that tones are at least as important as vowels for conveying information in Mandarin.
We then carry out a large and thorough investigation of possible acoustic features to recognize tones. This involves hundreds of experiments, each involves classifying over a hundred thousand syllables from ten hours of broadcast news speech.
After determining a base set of features (based on pitch, duration, and overall intensity) that achieve a syllable classification rate of 58.9.
Experiments on a subset of our data show that simple features based on energy in various frequency bands work better for tone recognition than those based on more complicated methods like harmonic-amplitude differences and glottal flow estimation. Further experiments determine a set of band energy features that improve classification accuracy to 63.7%, with the F score for Neutral Tone increasing from 0.345 to 0.619. This opens up a host of new features for future speech researchers in industry and academia to investigate and use.
We investigate making additional use of context: if we know the tones of the surrounding syllables, we can only increase classification accuracy to 67.2%. (This provides a useful upper bound for our experiments.) While we do not have such ideal contextual information, we can use estimates of it to increase accuracy to 65.0%.
Finally, we investigate the hypothesis that better articulated syllables are easier to recognize. On a small corpus of lab speech from Xu (1999), we classify syllables in focussed words with over 99% accuracy, and use this to improve classification accuracy of all syllables. However, in news broadcast speech, we find that while stronger syllables are recognized better, the difference is not enough to suggest an algorithm that makes use of it.