Sequence Prediction Exploiting Similary Information

When data is scarce or the alphabet is large, smoothing the probability estimates becomes inescapable when estimating n-gram models. In this paper we propose a method that implements a form of smoothing by exploiting similarity information of the alphabet elements. The idea is to view the log-conditional probability function as a smooth function defined over the similarity graph. The algorithm that we propose uses the eigenvectors of the similarity graph as the basis of the expansion of the log conditional probability function whose coefficients are found by solving a regularized logistic regression problem. The experimental results demonstrate the superiority of the method when the similarity graph contains relevant information, whilst the method still remains competitive with state-of-the-art smoothing methods even in the lack of such information.

[1]  Joerg P. Ueberla,et al.  An extended clustering algorithm for statistical language models , 1994, IEEE Trans. Speech Audio Process..

[2]  Fan Chung,et al.  Spectral Graph Theory , 1996 .

[3]  Lawrence Carin,et al.  Sparse multinomial logistic regression: fast algorithms and generalization bounds , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Andrew Y. Ng,et al.  Learning random walk models for inducing word dependency distributions , 2004, ICML.

[5]  Jerome R. Bellegarda,et al.  A novel word clustering algorithm based on latent semantic analysis , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[6]  R. Taylor,et al.  The Numerical Treatment of Integral Equations , 1978 .

[7]  Joshua Goodman,et al.  A bit of progress in language modeling , 2001, Comput. Speech Lang..

[8]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[9]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[10]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[11]  Ido Dagan,et al.  Similarity-Based Models of Word Cooccurrence Probabilities , 1998, Machine Learning.

[12]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[13]  Arthur D. Szlam,et al.  Diffusion wavelet packets , 2006 .

[14]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[15]  Chris H. Q. Ding,et al.  Unsupervised Learning: Self-aggregation in Scaled Principal Component Space , 2002, PKDD.

[16]  Yi Lin Tensor product space ANOVA models , 2000 .

[17]  R. Coifman,et al.  Diffusion Wavelets , 2004 .

[18]  S. Griffis EDITOR , 1997, Journal of Navigation.