Coping with language data sparsity: Semantic head mapping of compound words

In this paper we present a novel clustering technique for compound words. By mapping compounds onto their semantic heads, the technique is able to estimate n-gram probabilities for unseen compounds. We argue that compounds are well represented by their heads which allows the clustering of rare words and reduces the risk of over-generalization. The semantic heads are obtained by a two-step process which consists of constituent generation and best head selection based on corpus statistics. Experiments on Dutch read speech show that our technique is capable of correctly identifying compounds and their semantic heads with a precision of 80.25% and a recall of 85.97%. A class-based language model with compound-head clusters achieves a significant reduction in both perplexity and WER.

[1]  Nancy Ide,et al.  Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art , 1998, Comput. Linguistics.

[2]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[3]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[4]  Kris Demuynck,et al.  Automatic generation of phonetic transcriptions for large speech corpora , 2002, INTERSPEECH.

[5]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[6]  Stanley F. Chen,et al.  An empirical study of smoothing techniques for language modeling , 1999 .

[7]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[8]  Franciska de Jong,et al.  Compound decomposition in dutch large vocabulary speech recognition , 2003, INTERSPEECH.

[9]  Vincent Vandeghinste,et al.  A hybrid approach to compounds in LVCSR , 2002, INTERSPEECH.

[10]  Geert Booij,et al.  The Morphology of Dutch , 2002 .

[11]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[12]  Partha Pratim Talukdar,et al.  Automatic Generation of Compound Word Lexicon for Hindi Speech Synthesis , 2004, LREC.

[13]  Nelleke Oostdijk,et al.  The Spoken Dutch Corpus , 2000 .

[14]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[15]  Patrick Wambacq,et al.  The ESAT 2008 system for N-Best Dutch speech recognition benchmark , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[16]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[17]  Jean-Pierre Martens,et al.  Reducing speech recognition time and memory use by means of compound (de-)composition , 2008 .

[18]  Yong Qin,et al.  Generating compound words with high order n-gram information in large vocabulary speech recognition systems , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Hermann Ney,et al.  Compound Word Recombination for German LVCSR , 2011, INTERSPEECH.