Abin-based ontological framework for low-resourcen-gram smoothing in language modelling

In this paper, we introduce a novel method of smoothing language models (LM) based on the semantic information found in ontologies that is especially adapted for limited-resources language modeling. We exploit the latent knowledge of language that is deeply encoded within ontologies. As such, this work examines the potential of using the semantic and syntactic relations between words from the WordNet ontology to generate new plausible contexts for unseen events to simulate a larger corpus. These unseen events are then mixed-up with a baseline Witten-Bell(WB) LM in order to improve its performance both in terms of language model perplexity and automatic speech recognition word error rates. Results indicate a significant reduction in the perplexity of the language model (up to 9.85% relative) all the while reducing word error rate in a statistically significant manner compared to both the original WB LM and baseline Kneser-Ney smoothed language model on the Wall Street Journal-based Continuous Speech Recognition Phase II corpus.

[1]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[2]  Masatoshi Tsuchiya,et al.  Topic-Dependent-Class-Based $n$-Gram Language Model , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Keith Vertanen Baseline Wsj Acoustic Models for Htk and Sphinx : Training Recipes and Recognition Experiments , 2007 .

[4]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[5]  Steve Renals,et al.  Hierarchical Bayesian Language Models for Conversational Speech Recognition , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Steve Young,et al.  The HTK hidden Markov model toolkit: design and philosophy , 1993 .

[7]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[8]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[9]  M. J. D. Powell,et al.  An efficient method for finding the minimum of a function of several variables without calculating derivatives , 1964, Comput. J..

[10]  Panayiotis G. Georgiou,et al.  Building topic specific language models from webdata using competitive models , 2005, INTERSPEECH.

[11]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[12]  Eric Jones,et al.  SciPy: Open Source Scientific Tools for Python , 2001 .

[13]  Andreas Stolcke,et al.  Getting More Mileage from Web Text Sources for Conversational Speech Language Modeling using Class-Dependent Mixtures , 2003, NAACL.

[14]  Myoung-Ho Kim,et al.  Information Retrieval Based on Conceptual Distance in is-a Hierarchies , 1993, J. Documentation.

[15]  M. Ng,et al.  Ontology-based Distance Measure for Text Clustering , 2006 .

[16]  Ronald Rosenfeld,et al.  Improving trigram language modeling with the World Wide Web , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[17]  Mathieu Bastian,et al.  Gephi: An Open Source Software for Exploring and Manipulating Networks , 2009, ICWSM.

[18]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.