Robust Phonetic Segmentation Using Spectral Transition measure for Non-Standard Recording Environments

Phone level localization of mis-articulation is a key requirement for an automatic articulation error assessment system. A robust phone segmentation technique is essential to aid in real-time assessment of phone level mis-articulations of speech, wherein the audio is recorded on mobile phones or tablets. This is a non-standard recording set-up with little control over the quality of recording. We propose a novel post processing technique to aid Spectral Transition Measure(STM)-based phone segmentation under noisy conditions such as environment noise and clipping, commonly present during a mobile phone recording. A comparison of the performance of our approach and phone segmentation using traditional MFCC and PLPCC speech features for Gaussian noise and clipping is shown. The proposed approach was validated on TIMIT and Hindi speech corpus and was used to compute phone boundaries for a set of speech, recorded simultaneously on three devices - a laptop, a stationarily placed tablet and a handheld mobile phone, to simulate different audio qualities in a real-time non-standard recording environment. F-ratio was the metric used to compute the accuracy in phone boundary marking. Experimental results show an improvement of 7% for TIMIT and 10% for Hindi data over the baseline approach. Similar results were seen for the set of three of recordings collected in-house.

[1]  Andreas Stolcke,et al.  Automatic phonetic segmentation using boundary models , 2013, INTERSPEECH.

[2]  Preeti Rao,et al.  Improving the robustness of phonetic segmentation to accent and style variation with a two-staged approach , 2009, INTERSPEECH.

[3]  Mariusz Ziólko,et al.  Wavelet method of speech segmentation , 2006, 2006 14th European Signal Processing Conference.

[4]  Nikos Fakotakis,et al.  Phonetic segmentation using multiple speech features , 2008, Int. J. Speech Technol..

[5]  Lawrence R. Rabiner,et al.  On the Relation between Maximum Spectra Boundaries , 2006 .

[6]  Jordi Adell,et al.  Towards phone segmentation for concatenative speech synthesis , 2004, SSW.

[7]  Mark Hasegawa-Johnson,et al.  Accurate speech segmentation by mimicking human auditory processing , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Andreas Stolcke,et al.  Highly accurate phonetic segmentation using boundary correction models and system fusion , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Hemant A. Patil,et al.  Effectiveness of PLP-based phonetic segmentation for speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  S. Furui On the role of spectral transition for speech perception. , 1986, The Journal of the Acoustical Society of America.

[11]  Richard M. Stern,et al.  Least squares signal declipping for robust speech recognition , 2014, INTERSPEECH.

[12]  Beng T. Tan,et al.  Applying wavelet analysis to speech segmentation and classification , 1994, Defense, Security, and Sensing.

[13]  Odette Scharenborg,et al.  Finding Maximum Margin Segments in Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.