Word Clustering Based on Similarity and Vari-gram Language Model

Cluster-based statistic language model is an important method to solve the problem of sparse data.Conventional statistical clustering methods usually base on greedy principle.The common Metric for evaluating a clustering algorithm is the likelihood function or perplexity of the corpus.Conventional clustering algorithms often converge to a local optimum,so global optimum is not guaranteed,and initial choices can influence final result.The author tries to solve above problems in this paper,and presents a definition of word similarity by utilizing mutual information. Based on word similarity,a bottom-up hierarchical clustering algorithm is proposed.Experiments show that word clustering algorithm based on similarity is better than conventional greedy clustering method in speed and performance.At the same time,a new method to create the vari-gram language model is presented.