Improving n-gram models by incorporating enhanced distributions

Two methods of improving conventional n-gram statistical language models are examined. The first involves using a new set of n-gram statistics that attempt to improve the ability of a system to identify phrases correctly. The second involves replacing the maximum likelihood unigram component with an optimised distribution. We test these approaches by incorporating them into weighted average [1] and deleted estimate [2] language models trained on a large newspaper corpus. The improvements lead to a reduction in perplexity of 4.5% and 4.9% respectively for these models.