Décodage conceptuel et apprentissage automatique : application au corpus de dialogue Homme-Machine MEDIA

Within the framework of the French evaluation program MEDIA on spoken dialogue systems, this paper presents the methods proposed at the LIA for the robust extraction of basic conceptual constituents (or concepts) from an audio message. The conceptual decoding model proposed follows a stochastic paradigm and is directly integrated into the Automatic Speech Recognition (ASR) process. This approach allows us to keep the probabilistic search space on words produced by the ASR module and to project it to a probabilistic search space of concepts. The experiments carried on on the MEDIA corpus show that the performance reached by our approach is state of the art on manual transcriptions of dialogues. By partitioning the training corpus according to different sizes, one can measure the impact of the training corpus on the decoding performance and therefore estimate the minimal as well as the optimal number of dialogue examples needed. Finally we detail how a priori knowledge can be integrated in our models in order to increase their coverage and therefore lowering, for the same level of performance, the amount of training corpus needed.