论文信息 - Release ‘Bag-of-Words’ Assumption of Latent Dirichlet Allocation

Release ‘Bag-of-Words’ Assumption of Latent Dirichlet Allocation

Based on vector-based representation, topic models, like latent Dirichlet allocation (LDA), are constructed for documents with ‘bag-of-words’ assumption. They can discover the distribution of underlying topics in a document and the distribution of keywords in a topic, which have been proved very successful and practical in many scenarios, recently. Comparing vector-based representation of documents, graph-based representation method can preserve more semantics of documents, because not only keywords but also the relations between them in documents are considered. In this paper, a topic model for graph-represented documents (GTM) is proposed. In this model, a Bernoulli distribution is used to model the formation of the edge between two keywords in a document. The experimental results show that GTM outperforms LDA in document classification task using the unveiled topics from these two models to represent documents.

[1] Michael Gamon. Graph-Based Text Representation for Novelty Detection , 2006 .

[2] Ramesh Nallapati,et al. Joint latent topic models for text and citations , 2008, KDD.

[3] Xiangfeng Luo,et al. Semantic representation of scientific documents for the e-science Knowledge Grid , 2008, SKG 2008.

[4] Hidekazu Nakawatase,et al. Calculating similarity between texts using graph-based text representation model , 2004, CIKM '04.

[5] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6] Chao Wang,et al. Mining key information of web pages: A method and its application , 2007, Expert Syst. Appl..

[7] Rohini K. Srihari,et al. Graph-based text representation and knowledge discovery , 2007, SAC '07.

[8] Chao Wang,et al. Integration of Ontology Data through Learning Instance Matching , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[9] Peter A. Chew,et al. Term Weighting Schemes for Latent Dirichlet Allocation , 2010, NAACL.

[10] Sherry Marcus,et al. Graph-based technologies for intelligence analysis , 2004, CACM.