An Improved LDA Model for Academic Document Analysis

Electronic documents on the Internet are always generated with many kinds of side information. Although those massive kinds of information make the analysis become very difficult, models would fit and analyze data well if they could make full use of those kinds of side information. This paper, base on the study on probabilistic topic model, proposes a new improved LDA model which is suitable for analysis of academic document. Based on the modification of standard LDA model, this new improved LDA model could analyze documents with both authors and references. To evaluate the generalization capability, this paper compares the new model with standard LDA and DMR model using the widely used Rexa dataset. Experimental results show that the new model has a high capability of document clustering and topics extraction than standard LDA and its modifications. In addition, the new model outperforms DMR model in task of authors discriminant.

[1]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[2]  Ruoming Jin,et al.  A Topic Modeling Approach and Its Integration into the Random Walk Framework for Academic Search , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[3]  Hongfei Yan,et al.  Jointly Modeling Aspects and Opinions with a MaxEnt-LDA Hybrid , 2010, EMNLP.

[4]  Hongyuan Zha,et al.  Exploring social annotations for information retrieval , 2008, WWW.

[5]  Xiaohua Hu,et al.  Author-conference topic-connection model for academic network search , 2012, CIKM '12.

[6]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[7]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[8]  Yizhou Sun,et al.  ETM: Entity Topic Models for Mining Documents Associated with Entities , 2012, 2012 IEEE 12th International Conference on Data Mining.

[9]  Pu-Jen Cheng,et al.  Visualizing timelines: evolutionary summarization via iterative reinforcement between text and image streams , 2012, CIKM.

[10]  Susan T. Dumais,et al.  Partially labeled topic models for interpretable text mining , 2011, KDD.

[11]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[12]  Jie Yuan,et al.  Topic Discovery based on LDA_col Model and Topic Significance Re-ranking , 2011, J. Comput..

[13]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[14]  J. Diebolt,et al.  A Stochastic EM algorithm for approximating the maximum likelihood estimate , 1995 .

[15]  Andrew McCallum,et al.  Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression , 2008, UAI.

[16]  Mark Johnson,et al.  A Bayesian LDA-based model for semi-supervised part-of-speech tagging , 2007, NIPS.

[17]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Chanle Wu,et al.  A New Intelligent Topic Extraction Model on Web , 2011, J. Comput..

[19]  Eric P. Xing,et al.  MedLDA: maximum margin supervised topic models , 2012, J. Mach. Learn. Res..

[20]  Zhi-Hua Zhou,et al.  ML-KNN: A lazy learning approach to multi-label learning , 2007, Pattern Recognit..

[21]  Fei-Fei Li,et al.  Large Margin Learning of Upstream Scene Understanding Models , 2010, NIPS.

[22]  Tom Minka,et al.  Expectation-Propogation for the Generative Aspect Model , 2002, UAI.

[23]  Sean Gerrish,et al.  A Language-based Approach to Measuring Scholarly Impact , 2010, ICML.