Exploiting semantic associative information in topic modeling

Topic modeling has been widely applied in a variety of text modeling tasks as well as in speech recognition systems for effectively capturing the semantic and statistic information in documents or speech utterances. Most topic models rely on the bag-of-words assumption that results in learned latent topics composed of lists of individual words. Unfortunately, these words may convey topical information but lack accurate semantic knowledge of the text. In this paper, we present the semantic associative topic model, where the concept of the semantic association terms is extended to topic modeling, which provides guidance on modeling the semantic associations that occur among single words by expressing a document as an association of multiple words. Further, the pointwise KL-divergence metric is used to measure the significance of the association. We also integrate original PLSA and SATM models, which have mixed feature representations. Experimental results on WSJ and AP datasets show that the proposed approaches achieved higher performance compared to other methods.

[1]  Jen-Tzung Chien,et al.  Nonstationary latent Dirichlet allocation for speech recognition , 2009, INTERSPEECH.

[2]  Xihong Wu,et al.  Refine bigram PLSA model by assigning latent topics unevenly , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[3]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[4]  Lin-Shan Lee,et al.  Latent semantic retrieval of spoken documents over position specific posterior lattices , 2008, 2008 IEEE Spoken Language Technology Workshop.

[5]  Matthew Hurst,et al.  A Language Model Approach to Keyphrase Extraction , 2003, ACL 2003.

[6]  Jen-Tzung Chien,et al.  Minimum rank error training for language modeling , 2007, INTERSPEECH.

[7]  Jen-Tzung Chien,et al.  Adaptive Bayesian Latent Semantic Analysis , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[9]  Hanna M. Wallach,et al.  Topic modeling: beyond bag-of-words , 2006, ICML.

[10]  T DumaisSusan,et al.  Using linear algebra for intelligent information retrieval , 1995 .

[11]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[12]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[13]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[14]  Jen-Tzung Chien,et al.  Mining of association patterns for language modeling , 2004, INTERSPEECH.

[15]  Kate Knill,et al.  Improved language modelling using bag of word pairs , 2009, INTERSPEECH.

[16]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..