Cross-Document Knowledge Discovery Using Semantic Concept Topic Model

Topic models employ the Bag-of-Words (BOW) representation, which break terms into constituent words and treat words as surface strings without assuming predefined knowledge about word meaning. In this paper, we propose the Semantic Concept Latent Dirichlet Allocation (SCLDA) and Semantic Concept Hierarchical Dirichlet Process (SCHDP) based approaches by representing text as meaningful concepts rather than words, using a new model known as Bag-of-Concepts (BOC). We propose new algorithms of applying SCLDA and SCHDP into the Concept Chain Queries (CCQ) problem. The algorithms are focused on discovering new semantic relationships between two concepts across documents where relationships found reveal semantic paths linking two concepts across multiple text units. The experiments demonstrate the search quality has been greatly improved, compared with using other LDA or HDP based approaches.

[1]  Wei Jin,et al.  Improving Cross-Document Knowledge Discovery Using Explicit Semantic Analysis , 2012, DaWaK.

[2]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[3]  Rohini K. Srihari,et al.  Knowledge Discovery across Documents through Concept Chain Queries , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[4]  Chong Wang,et al.  Online Variational Inference for the Hierarchical Dirichlet Process , 2011, AISTATS.

[5]  Ian H. Witten,et al.  An open-source toolkit for mining Wikipedia , 2013, Artif. Intell..

[6]  Luis Anido Rifón,et al.  Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach , 2015 .

[7]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[8]  Cheng Niu,et al.  InfoXtract: a customizable intermediate level information extraction engine , 2003, HLT-NAACL 2003.

[9]  A. McCallum,et al.  Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[10]  Rickard Cöster,et al.  Using Bag-of-Concepts to Improve the Performance of Support Vector Machines in Text Categorization , 2004, COLING.

[11]  Padhraic Smyth,et al.  Combining Background Knowledge and Learned Topics , 2011, Top. Cogn. Sci..

[12]  Jian Hu,et al.  Using Wikipedia knowledge to improve text classification , 2009, Knowledge and Information Systems.

[13]  Xin Wu,et al.  Improving Knowledge Discovery in Document Collections through Combining Text Retrieval and Link Analysis Techniques , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[14]  Erik Cambria,et al.  Commonsense-based topic modeling , 2013, WISDOM '13.

[15]  Meng-Sung Wu Modeling query-document dependencies with topic language models for information retrieval , 2015, Inf. Sci..

[16]  William W. Cohen,et al.  From Topic Models to Semi-supervised Learning: Biasing Mixed-Membership Models to Exploit Topic-Indicative Features in Entity Clustering , 2013, ECML/PKDD.

[17]  Hanna M. Wallach,et al.  Topic modeling: beyond bag-of-words , 2006, ICML.

[18]  Hui Xiong,et al.  Topic Modeling of Short Texts: A Pseudo-Document View , 2016, KDD.