A Comparison of Subjective Methods for Evaluating Speech Quality

With the advances realized in voice coding algorithms over the past two decades it has become increasingly evident that speech intelligibility, alone, is not a sufficient criterion of system performance. As a result, a number of methods have been developed to measure the quality or acceptability of speech. Several methods have been used fairly extensively. These include, in particular, the Diagnostic Acceptability Measure (DAM), which reports a Composite Acceptability Estimate (CAE), the Absolute Category Rating (ACR) method, which reports a Mean Opinion Score (MOS), and the Degradation Category Rating (DCR) method, which reports a Degradation Mean Opinion Score (DMOS). Comparison of these methods, based solely on data in the literature, is difficult, if not impossible. Given the many recent developments in speech coding technology for network and wireless applications, there is a clear need for a rigorous comparative evaluation of the major methods of acceptability evaluation. The purposes of this investigation were (1) to examine the interrelations among scores yielded by three methods of evaluating speech acceptability and (2) to compare the resolving powers of these methods with several types of coincidental and systematic speech degradation commonly encountered in modern digital voice communications.