Semantic Spaces for Improving Language Modeling I

AbstractLanguage models are crucial for many tasks in NLP (Natural Language Processing) and n-grams are thebest way to build them. Huge e ort is being invested in improving n-gram language models. By introducingexternal information (morphology, syntax, partitioning into documents, etc.) into the models a signi cantimprovement can be achieved. The models can however be improved with no external information andsmoothing is an excellent example of such an improvement.In this article we show another way of improving the models that also requires no external information.We examine patterns that can be found in large corpora by building semantic spaces (HAL, COALS,BEAGLE and others described in this article). These semantic spaces have never been tested in languagemodeling before. Our method uses semantic spaces and clustering to build classes for a class-based languagemodel. The class-based model is then coupled with a standard n-gram model to create a very e ectivelanguage model.Our experiments show that our models reduce the perplexity and improve the accuracy of n-gram lan-guage models with no external information added. Training of our models is fully unsupervised. Our modelsare very e ective for inectional languages, which are particularly hard to model. We show results for vedi erent semantic spaces with di erent settings and di erent number of classes. The perplexity tests are ac-companied with machine translation tests that prove the ability of proposed models to improve performanceof a real-world application.Keywords: Class-based language models, Semantic spaces, HAL, COALS, BEAGLE, Random Indexing,Purandare&Pedersen, Clustering, Inectional languages, Machine translation.1. IntroductionLanguage modeling is a crucial task in many areas of NLP. Speech recognition, optical character recog-nition and many other areas heavily depend on the performance of the language model that is being used.Each improvement in language modeling may also improve the particular job where the language model isused.Research into language modeling started more than 20 years ago and has evolved into a very maturediscipline. Now it is very dicult to outperform the state of the art. Our research is focused on inectionallanguages as we believe that these languages o er some room for improvement. We however also provideexperiments for English (which is not a very inectional language). Even in the case of English, we wereable to obtain positive results.Czech and Slovak belong to the Slavic language group. These languages are highly inectional and havea relatively free word order. Czech has seven cases and three genders. Slovak has six cases and also three

[1]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[2]  Feifan Liu,et al.  Unsupervised Language Model Adaptation Incorporating Named Entity Information , 2007, ACL.

[3]  Curt Burgess,et al.  Modelling Parsing Constraints with High-dimensional Context Space , 1997 .

[4]  Gailius Raskinis,et al.  Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition , 2004, Informatica.

[5]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[6]  Andreas Stolcke,et al.  Morphology-based language modeling for conversational Arabic speech recognition , 2006, Comput. Speech Lang..

[7]  Yoshinori Sagisaka,et al.  Multi-class composite N-gram based on connection direction , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[8]  Bhuvana Ramabhadran,et al.  A study of unsupervised clustering techniques for language modeling , 2008, INTERSPEECH.

[9]  Tanja Schultz,et al.  Dynamic language model adaptation using variational Bayes inference , 2005, INTERSPEECH.

[10]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[11]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[12]  Magnus Sahlgren,et al.  An Introduction to Random Indexing , 2005 .

[13]  Ted Pedersen,et al.  Word Sense Discrimination by Clustering Contexts in Vector and Similarity Spaces , 2004, CoNLL.

[14]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Jianfeng Gao,et al.  The Use of Clustering Techniques for Language Modeling V Application to Asian Language , 2001, ROCLING/IJCLCLP.

[16]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[17]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[18]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[19]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[20]  Yasuo Ariki,et al.  Topic tracking language model for speech recognition , 2011, Comput. Speech Lang..

[21]  Philip C. Woodland,et al.  Language modelling for Russian and English using words and classes , 2003, Comput. Speech Lang..

[22]  J.R. Bellegarda,et al.  Exploiting latent semantic information in statistical language modeling , 2000, Proceedings of the IEEE.

[23]  Thomas Hofmann,et al.  Topic-based language models using EM , 1999, EUROSPEECH.

[24]  W. Charles Contextual correlates of meaning , 2000, Applied Psycholinguistics.

[25]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[26]  Tanja Schultz,et al.  Unsupervised language model adaptation using latent semantic marginals , 2006, INTERSPEECH.

[27]  Jerome R. Bellegarda,et al.  A novel word clustering algorithm based on latent semantic analysis , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[28]  Feifan Liu,et al.  Unsupervised language model adaptation via topic modeling based on named entity hypotheses , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[30]  Haizhou Li,et al.  Building class-based language models with contextual statistics , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[31]  Curt Burgess,et al.  Producing high-dimensional semantic spaces from lexical co-occurrence , 1996 .

[32]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[33]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[34]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[35]  Dale Schuurmans,et al.  Semantic n-gram language modeling with the latent maximum entropy principle , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[36]  Keith Stevens,et al.  The S-Space Package: An Open Source Package for Word Space Models , 2010, ACL.

[37]  Michael N Jones,et al.  Representing word meaning and order information in a composite holographic lexicon. , 2007, Psychological review.

[38]  P. Kanerva,et al.  Permutations as a means to encode order in word space , 2008 .