Semantic class induction and its application for a Chinese voice search system

In this paper, we propose a novel similarity measure based on co-occurrence probabilities for inducing semantic classes. Clustering with the new similarity measure outperformed that with the widely used distance measure based on Kullback-Leibler divergence in precision, recall and F1 evaluation. We then use the induced semantic classes and structures by the new similarity measure to generate in-domain data. At last, we use the generated data to do language model adaptation and improve the result of character recognition from 85.2% to 91%.

[1]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[2]  Giuseppe Riccardi,et al.  Grammar Fragment acquisition using syntactic and semantic clustering , 1998, Speech Commun..

[3]  David J. Weir,et al.  Characterising Measures of Lexical Distributional Similarity , 2004, COLING.

[4]  Gökhan Tür,et al.  An active approach to spoken language processing , 2006, TSLP.

[5]  A. Potamianos,et al.  Combining statistical similarity measures for automatic induction of semantic classes , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[6]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[7]  Jerome R. Bellegarda,et al.  Statistical language model adaptation: review and perspectives , 2004, Speech Commun..

[8]  Dong Yu,et al.  An introduction to voice search , 2008, IEEE Signal Processing Magazine.

[9]  Stephanie Seneff,et al.  Automatic induction of language model data for a spoken dialogue system , 2006, SIGDIAL.

[10]  Eric Fosler-Lussier,et al.  UNSUPERVISED COMBINATION OF METRICS FOR SEMANTIC CLASS INDUCTION , 2006, 2006 IEEE Spoken Language Technology Workshop.

[11]  Chin-Hui Lee,et al.  Auto-induced semantic classes , 2004, Speech Commun..

[12]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[13]  Helen M. Meng,et al.  Semiautomatic Acquisition of Semantic Structures for Understanding Domain-Specific Natural Language Queries , 2002, IEEE Trans. Knowl. Data Eng..

[14]  Alexandros Potamianos,et al.  A soft-clustering algorithm for automatic induction of semantic classes , 2007, INTERSPEECH.

[15]  Ido Dagan,et al.  Similarity-Based Models of Word Cooccurrence Probabilities , 1998, Machine Learning.