Evaluating dialog systems used in the real world

An important aspect of creating high performance natural language dialog systems is the question of how they are evaluated. While a universally accepted method for doing so for pure speech recognition exists, this is not clear for speech understanding or dialog systems. We describe the methods we typically use for our systems and argue that it is not sufficient to evaluate their constituents separately. Instead, a measure for a system in its entity is needed.

[1]  Volker Steinbiss,et al.  The Philips automatic train timetable information system , 1995, Speech Commun..

[2]  Hermann Ney,et al.  Look-ahead techniques for fast beam search , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  M. Oerder,et al.  Database query generation from spoken sentences , 1994, Proceedings of 2nd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications.

[4]  Hermann Ney,et al.  Fast likelihood computation methods for continuous mixture densities in large vocabulary speech recognition , 1997, EUROSPEECH.

[5]  P. J. Price,et al.  Evaluation of Spoken Language Systems: the ATIS Domain , 1990, HLT.