Measuring Speech Activity

This report discusses the algorithm described in ITU-T Recommendation P.56 for measuring the active speech level. Method B in P.56 determines a speech activity factor representing the fraction of time that the signal is considered to be active speech (as opposed to background idle noise) and the corresponding active level for the speech part of the signal. The basic algorithm generates an envelope value at each sample time. The envelope values are compared with a discrete set of thresholds. The (approximate) active speech level is determined by interpolating in the log domain between the threshold values. In this report we assess the effects on the speech active level due to interpolation. Recommendation P.56 allows for sampling rates as low as 600 Hz. Results for subsampled data are compared with those calculated at the full speech sampling rate. Measuring Speech Activity 1 Measuring Speech Activity Speech activity measurement involves determining the fraction of time that a signal contains active speech and the speech level while speech is active. Knowledge of the speech activity is important in speech signal measurements. For speech data bases, it is important to ensure that undue leading and trailing non-speech be excised and that the speech level be properly scaled based on the peak signal level and the active speech level [1]. For testing speech coders with environmental noise, artificial test signals are created by adding recorded background noise to clean speech segments. The signal-to-noise ratio for such speech-plus-noise signals is determined as the ratio of the active level for the speech to the rms level for the recorded noise [1]. In the speech coding community, considerable research effort is being expended on variable rate coders or discontinuous transmission systems that attempt to economize on average bit rate and/or power consumption by exploiting the fact the speech occurs in talk spurts. The efficacy of such techniques can be compared to speech activity measurements. Specifications for the measurement of the level of speech signals are given in ITU-T (International Telecommunication Union, Telecommunication Standardization Sector) Recommendation P.56 [2] as Method B. The measurement of the active level of speech takes into account the fact that speech may contain embedded pauses. Experiments have shown that listeners will perceive a pause in the speech if there is a gap of 350–400 ms or larger [3]. If such gaps are due to pauses between phrases or pauses to emphasize words, they are termed grammatical pauses. Grammatical pauses and other long gaps with idle noise do not affect the perceived loudness and are not counted as active speech. The smaller gaps inherent in any utterance are termed structural pauses and are counted as part of the active speech segment. The output of the speech activity algorithm is a speech activity factor representing the fraction of the signal that can be considered to be active speech and the corresponding active speech level for the speech part of the signal. An implementation of a Speech Voltmeter using the algorithm in Recommendation P.56 is part of the ITU-T Software Tools Library [4][5] referred to here as ITU-T STL. The algorithm under discussion presents a active level information for an utterance as a whole. Measuring Speech Activity 2 Other speech level measurements rely on an immediate indication of the speech level and are meant for a real-time indication of level (see the discussion of Method A in [2]). An example is the volume unit (VU) meter often seen on both professional and consumer audio equipment. 1 Envelope Calculation The speech activity algorithm calculates an “envelope” for the speech signal. This is a double exponential filtering of the magnitude of the speech sample values, pi = gpi−1 + (1− g)|xi|, qi = gqi−1 + (1− g)|pi|. (1) The envelope qi is calculated starting with zero initial conditions.1 The parameter g is determined by the time constant of the averaging and is set to