The Algorithm of Modelling and Analysis of Latent Semantic Relations: Linear Algebra vs. Probabilistic Topic Models

This paper presents the algorithm of modelling and analysis of Latent Semantic Relations inside the argumentative type of documents collection. The novelty of the algorithm consists in using a systematic approach: in the combination of the probabilistic Latent Dirichlet Allocation (LDA) and Linear Algebra based Latent Semantic Analysis (LSA) methods; in considering each document as a complex of topics, defined on the basis of separate analysis of the particular paragraphs. The algorithm contains the following stages: modelling and analysis of Latent Semantic Relations consistently on LDA- and LSA-based levels; rules-based adjustment of the results of the two levels of analysis. The verification of the proposed algorithm for subjectively positive and negative Polish-language film reviews corpuses was conducted. The level of the recall rate and precision indicator, as a result of case study, allowed to draw the conclusions about the effectiveness of the proposed algorithm.

[1]  Thomas L. Griffiths,et al.  Online Inference of Topics with Latent Dirichlet Allocation , 2009, AISTATS.

[2]  Nina Rizun,et al.  The Method of a Two-level Text-meaning Similarity Approximation of the Customers' Opinions , 2016 .

[3]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[4]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[5]  Juan-Zi Li,et al.  Knowledge discovery through directed probabilistic topic models: a survey , 2010, Frontiers of Computer Science in China.

[6]  Susan T. Dumais,et al.  Using latent semantic analysis to improve information retrieval , 1988, CHI 1988.

[7]  Richard A. Harshman,et al.  Information Retrieval using a Singular Value Decomposition Model of Latent Semantic Structure , 1988, SIGIR Forum.

[8]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[9]  Nina Rizun,et al.  Development and Research of the Text Messages Semantic Clustering Methodology , 2016, 2016 Third European Network Intelligence Conference (ENIC).

[10]  Lars Elden,et al.  Matrix methods in data mining and pattern recognition , 2007, Fundamentals of algorithms.

[11]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[12]  T. Honkela,et al.  Term Weighting in Short Documents for Document Categorization , Keyword Extraction and Query Expansion , 2012 .

[13]  Krzysztof Tomanek Analiza sentymentu – metoda analizy danych jakościowych. Przykład zastosowania oraz ewaluacja słownika RID i metody klasyfikacji Bayesa w analizie danych jakościowych , 2014 .

[14]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[15]  Zoe Borovsky,et al.  Topic Modeling , 2017, Encyclopedia of Machine Learning and Data Mining.

[16]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[17]  David M. Blei,et al.  Introduction to Probabilistic Topic Models , 2010 .

[18]  Daniela Calvetti,et al.  Matrix methods in data mining and pattern recognition , 2009, Math. Comput..

[19]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[20]  Santosh S. Vempala,et al.  Latent Semantic Indexing , 2000, PODS 2000.

[21]  Mika Timonen,et al.  Term Weighting in Short Documents for Document Categorization, Keyword Extraction and Query Expansion , 2013 .

[22]  F. Jelinek,et al.  Perplexity—a measure of the difficulty of speech recognition tasks , 1977 .

[23]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[24]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[25]  S. T. Dumais,et al.  Using latent semantic analysis to improve access to textual information , 1988, CHI '88.

[26]  Charu C. Aggarwal,et al.  Mining Text Data , 2012, Springer US.

[27]  Leticia H. Anaya,et al.  Comparing Latent Dirichlet Allocation and Latent Semantic Analysis as Classifiers , 2011 .