Compression of acoustic features - are perceptual quality and recognition performance incompatible goals?

The client-server model is being advocated for speech recognition over networks, where the acoustic features are calculated by the client, compressed and transmitted to the server. This has provoked a number of papers claiming that as recognition accuracy and perceptual quality are different goals, a new compression approach is needed. This is verified by experiments in which codecs such as CELP are shown to produce degraded recognition performance, but that direct quantization of acoustic features at data rates as low as 4kbps gives little or no degradation. In this paper we show that the goals are not incompatible, and that a very low bit-rate codec can be used to perform the compression. We also show that if the ability to reproduce the speech is really not needed, a bit rate as low as 625 bit/sec can be achieved by computing and compressing posterior phone probabilities.