Measuring the Perceived Importance of Speech Segments for Transmission over IP Networks

SUMMARY This paper presents a way of using a linear regression model to produce a single-valued criterion that indicates the perceived importance of each block in a stream of speech blocks. This method is superior to the conventional approach, voice activity detection (VAD), in that it provides a dynamically changing priority value for speech segments with finer granularity. The approach can be used in conjunction with scalable speech coding techniques in the context of IP QoS services to achieve a flexible form of quality control for speech transmission. A simple linear regression model is used to estimate a mean opinion score (MOS) of the various cases of missing speech segments. The estimated MOS is a continuous value that can be mapped to priority levels with arbitrary granularity. Through subjective evaluation, we show the validity of the calculated priority values.

[1]  Allen Gersho,et al.  A 16-kbit/s bandwidth scalable audio coder based on the G.729 standard , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[2]  Joachim Stegmann,et al.  Robust classification of speech based on the dyadic wavelet transform with application to CELP coding , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[3]  Henning Schulzrinne,et al.  RTP: A Transport Protocol for Real-Time Applications , 1996, RFC.

[4]  T. Nomura,et al.  Voice over IP systems with speech bitrate adaptation based on MPEG-4 wideband CELP , 1999, 1999 IEEE Workshop on Speech Coding Proceedings. Model, Coders, and Error Criteria (Cat. No.99EX351).

[5]  M.T. Manzuri,et al.  A robust voice activity detection based on wavelet transform , 2008, 2008 Second International Conference on Electrical Engineering.

[6]  Lixia Zhang,et al.  Resource ReSerVation Protocol (RSVP) - Version 1 Functional Specification , 1997, RFC.

[7]  Anthony Ephremides,et al.  Multiple description coding in networks with congestion problem , 2001, IEEE Trans. Inf. Theory.

[8]  Wei Zhang,et al.  A soft voice activity detector based on a Laplacian-Gaussian model , 2003, IEEE Trans. Speech Audio Process..

[9]  Kazunori Ozawa,et al.  A bitrate and bandwidth scalable CELP coder , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[10]  Xavier Maitre,et al.  7 kHz audio coding within 64 kbit/s , 1988, IEEE J. Sel. Areas Commun..

[11]  William Equitz,et al.  Successive refinement of information , 1991, IEEE Trans. Inf. Theory.

[12]  Luiz A. DaSilva,et al.  Priority discarding of speech in integrated packet networks , 1989, IEEE J. Sel. Areas Commun..

[13]  David L. Black,et al.  An Architecture for Differentiated Service , 1998 .

[14]  Dan Grossman,et al.  New Terminology and Clarifications for Diffserv , 2002, RFC.

[15]  Yusuke Hiwasaki,et al.  Measuring the perceived importance of time- and frequency-divided speech blocks for transmitting over packet networks , 2004, INTERSPEECH.