Leveraging Unstructured Information Using Topic Modelling

-Unstructured information in the form of natural language text is abundant in various kinds of organisations. To increase information sharing, organisational learning, decisionmaking and productivity, large amounts of unstructured text need to be analysed on a daily basis. Full text searching alone is not sufficient as a first approach to help users understand what a collection of electronic documents is about, since it does not provide the user with an overview of the underlying concepts in the document collection. A topic model is a useful mechanism for identifying and characterising various concepts embedded in a document collection allowing the user to navigate the collection in a topicguided manner. Topics, made up of significant words, provide the user with an overview of the content of the document collection. Each document is represented as a mixture of automatically constructed topics and the user may select documents related to a specific topic of interest and vice versa. Similarities between documents may be found by looking at what documents are assigned to a specific topic enabling the user to find other documents related to a given document. This methodology enables users to digest a larger number of documents, assisting them in spending more of their time in actually reading than finding relevant information.

[1]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[2]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[3]  Tetsuya Nasukawa,et al.  Text analysis and knowledge mining system , 2001, IBM Syst. J..

[4]  William G. Holliday Modeling in Science. , 2001 .

[5]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[6]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[7]  P. Kantor Foundations of Statistical Natural Language Processing , 2001, Information Retrieval.

[8]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[9]  Y. Wang,et al.  A multi-facet taxonomy system with applications in unstructured knowledge management , 2005, J. Knowl. Manag..

[10]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[11]  Dunja Mladenic,et al.  Visualization of Text Document Corpus , 2005, Informatica.

[12]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[13]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[14]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[15]  Wei Li,et al.  Mixtures of hierarchical topics with Pachinko allocation , 2007, ICML '07.

[16]  Andrew McCallum,et al.  Expertise modeling for matching papers with reviewers , 2007, KDD '07.

[17]  Andrew McCallum,et al.  Mining a digital library for influential authors , 2007, JCDL '07.

[18]  Wei Li,et al.  Nonparametric Bayes Pachinko Allocation , 2007, UAI.

[19]  John D. Lafferty,et al.  A correlated topic model of Science , 2007, 0708.3601.