The topic modeling has long been used to check and explore the content of a document in dataset based on the search for hidden topics within the document. Over the years, many algorithms have evolved based on this model, with major approaches such as “bag-of-words” and vector spaces. These approaches mainly fulfill the search, statistics the frequency of occurrences of words related to the topic of the document, thereby extracting the topic model. However, with these approaches the structure of the sentence, namely the order of words, affects the meaning of the document is often ignored. In this paper, we propose a new approach to exploring the hidden topic of document in dataset using a co-occurrence graph. After that, the frequent subgraph mining algorithm is applied to model the topic. Our goal is to overcome the word order problem from affecting the meaning and topic of the document. Furthermore, we also implemented this model on a large distributed data processing system to speed up the processing of complex mathematical problems in graph, which required many of times to execute.
[1]
Thorsten Meinl,et al.
A Quantitative Comparison of the Subgraph Miners MoFa, gSpan, FFSM, and Gaston
,
2005,
PKDD.
[2]
Hidekazu Nakawatase,et al.
Graph-based text database for knowledge discovery
,
2004,
WWW Alt. '04.
[3]
Carmen Banea,et al.
Random-Walk Term Weighting for Improved Text Classification
,
2006
.
[4]
Sobha Lalitha Devi,et al.
Patent Document Summarization Using Conceptual Graphs
,
2017
.
[5]
Horst Bunke,et al.
A graph distance metric based on the maximal common subgraph
,
1998,
Pattern Recognit. Lett..
[6]
S. S. Sonawane,et al.
Graph based Representation and Analysis of Text Document: A Survey of Techniques
,
2014
.
[7]
Jiawei Han,et al.
gSpan: graph-based substructure pattern mining
,
2002,
2002 IEEE International Conference on Data Mining, 2002. Proceedings..
[8]
Michael I. Jordan,et al.
Latent Dirichlet Allocation
,
2001,
J. Mach. Learn. Res..
[9]
Nabila Khodeir.
Graphical Representation in Tutoring Systems
,
2017
.