Concept Discovery and Automatic Semantic Annotation for Language Understanding in an Information-Query Dialogue System Using Latent Dirichlet Allocation and Segmental Methods

Efficient statistical approaches have been recently proposed for natural language understanding in the context of dialogue systems. However, these approaches are trained on data semantically annotated at the segmental level, which increases the production cost of these resources. This kind of semantic annotation implies both to determine the concepts in a sentence and to link them to their corresponding word segments. In this paper, we propose a two-step automatic method for semantic annotation. The first step is an implementation of the latent Dirichlet allocation aiming at discovering concepts in a dialogue corpus. Then this knowledge is used as a bootstrap to infer automatically a segmentation of a word sequence into concepts using either integer linear optimisation or stochastic word alignment models (IBM models). The relation between automatically-derived and manually-defined task-dependent concepts is evaluated on a spoken dialogue task with a reference annotation.

[1]  Horst A. Eiselt,et al.  Location analysis: A synthesis and survey , 2005, Eur. J. Oper. Res..

[2]  Eric Fosler-Lussier,et al.  UNSUPERVISED COMBINATION OF METRICS FOR SEMANTIC CLASS INDUCTION , 2006, 2006 IEEE Spoken Language Technology Workshop.

[3]  Sophie Rosset,et al.  Semantic annotation of the French media dialog corpus , 2005, INTERSPEECH.

[4]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[5]  Wayne Ward Understanding Spontaneous Speech , 1989, HLT.

[6]  Renato De Mori,et al.  Spoken language interpretation: On the use of dynamic Bayesian networks for semantic composition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Chin-Hui Lee,et al.  Metrics for measuring domain independence of semantic classes , 2001, INTERSPEECH.

[8]  Hermann Ney,et al.  Comparing Stochastic Approaches to Spoken Language Understanding in Multiple Languages , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Milica Gasic,et al.  Spoken language understanding from unaligned data using discriminative classification models , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Susumu Horiguchi,et al.  Learning to classify short and sparse text & web with hidden topics from large-scale data collections , 2008, WWW.

[11]  Der-San Chen,et al.  Applied Integer Programming: Modeling and Solution , 2010 .

[12]  Helen M. Meng,et al.  Semi-automatic acquisition of domain-specific semantic structures , 1999, EUROSPEECH.

[13]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[14]  Fabrice Lefèvre Dynamic Bayesian Networks and Discriminative Classifiers for Multi-Stage Semantic Interpretation , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[15]  Fabrice Lef DYNAMIC BAYESIAN NETWORKS AND DISCRIMINATIVE CLASSIFIERS FOR MULTI-STAGE SEMANTIC INTERPRETATION , 2007 .

[16]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[17]  Tanja Schultz,et al.  Unsupervised language model adaptation using latent semantic marginals , 2006, INTERSPEECH.

[18]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[19]  Gokhan Tur,et al.  LDA Based Similarity Modeling for Question Answering , 2010, HLT-NAACL 2010.

[20]  F. Lefvre Dynamic Bayesian Networks and Discriminative Classifiers for Multi-Stage Semantic Interpretation , 2007 .

[21]  Wayne H. Ward Understanding spontaneous speech: the Phoenix system , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[22]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[23]  Fabrice Lefèvre,et al.  Unsupervised Alignment for Segmental-based Language Understanding , 2011, ULNLP@EMNLP.