Evaluation of a spoken dialogue system with usability tests and long-term pilot studies: similarities and differences

We present findings from the long-term study of a speech-based bus timetable system. After the deployment of the prototype system we have collected data from real usage for 30 months. In addition, we have conducted usability tests to get subjective ratings of the pilot system. The comparison of these evaluations shows that the results obtained with usability tests differ significantly from those gained from the real usage, and the data of the initial use differs significantly from the data collected after that. For example, the differences in help requests, interruptions, speech recognition rejections, silence timeouts, and repeat requests are highly significant, and in some cases, such as explicit quit requests, enormous (65% versus 3%).