Fuzzy-Gibbs latent Dirichlet allocation model for feature extraction on Indonesian documents

Latent Dirichlet Allocation is a topic-based feature extraction method that uses reasoning to find semantic relationship in corpus. Although Latent Dirichlet Allocation is very powerful in handling very large data sets, but it has a very high complexity along with increasing number of document to reach convergence. Latent Dirichlet Allocation generates probability for all topics in a document, which it contains uncertainty, so its relationship with number of iterations needs to be analyzed. In this paper, Latent Dirichlet Allocation modified by adding fuzzy logic in Gibbs sampling inference algorithm. Its purpose is to analyze the effect of fuzzy logic in handling uncertainty of the occurrence all topics in a document that affect number of iteration in reasoning. Fuzzy-Gibbs Latent Dirichlet Allocation algorithm is implemented on text data of Indonesian documents. Testing performed on three different sizes of data to determine the effect of the number of document to the number of iteration. The algorithm performance was also measured using Perplexity, Precision, Recall and F-Measure. 404 Putu Manik Prihatini et al. The test results show that Fuzzy-Gibbs Latent Dirichlet Allocation algorithm can reach convergence in a fewer iteration and has a better performance compared to Gibbs Sampling Latent Dirichlet Allocation algorithm.

[1]  Huanguo Zhang,et al.  Design and implementation of Weibo sentiment analysis based on LDA and dependency parsing , 2016, China Communications.

[2]  Jen-Tzung Chien,et al.  Hierarchical Pitman–Yor–Dirichlet Language Model , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Hua Xu,et al.  Implicit feature identification via hybrid association rule mining , 2013, Expert Syst. Appl..

[4]  Suvvada B V Varalakshmi Co-Extracting Opinion Targets and Opinion Words from Online Reviews Based on the Word Alignment Model , 2016 .

[5]  Putu Manik Prihatini,et al.  Text Processing Application for Indonesian Documents , 2017 .

[6]  Guillaume Bouchard,et al.  Latent IBP Compound Dirichlet Allocation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Georgina Cosma,et al.  An Approach to Source-Code Plagiarism Detection and Investigation Using Latent Semantic Analysis , 2012, IEEE Transactions on Computers.

[8]  William M. Darling A Theoretical and Practical Implementation Tutorial on Topic Modeling and Gibbs Sampling , 2011 .

[9]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[10]  Laliteshwari,et al.  Relevance Feature Discovery for Text Mining , 2016 .

[11]  Gregor Heinrich Parameter estimation for text analysis , 2009 .

[12]  Jun Zhang,et al.  Power Series Representation Model of Text Knowledge Based on Human Concept Learning , 2014, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[13]  Jen-Tzung Chien,et al.  Topic-Based Hierarchical Segmentation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Mohamed Morchid,et al.  Compact Multiview Representation of Documents Based on the Total Variability Space , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Zhi-Qiang Liu,et al.  Type-2 Fuzzy Topic Models for Human Action Recognition , 2015, IEEE Transactions on Fuzzy Systems.

[16]  Yiman Du,et al.  Leveraging longitudinal driving behaviour data with data mining techniques for driving style analysis , 2015 .

[17]  Shuiwang Ji,et al.  A Probabilistic Latent Semantic Analysis Model for Coclustering the Mouse Brain Atlas , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[18]  Alexander T. Ihler,et al.  Understanding Errors in Approximate Distributed Latent Dirichlet Allocation , 2012, IEEE Transactions on Knowledge and Data Engineering.

[19]  Ricardo M. Marcacini,et al.  Interactive textual feature selection for consensus clustering , 2015, Pattern Recognit. Lett..

[20]  Sungjoo Lee,et al.  Keyword selection and processing strategy for applying text mining to patent analysis , 2015, Expert Syst. Appl..

[21]  Yueting Zhuang,et al.  Graph Regularized Feature Selection with Data Reconstruction , 2016, IEEE Transactions on Knowledge and Data Engineering.

[22]  Hui Xiong,et al.  Mining Distinction and Commonality across Multiple Domains Using Generative Model for Text Classification , 2012, IEEE Transactions on Knowledge and Data Engineering.

[23]  Raymond Y. K. Lau,et al.  A Probabilistic Generative Model for Mining Cybercriminal Networks from Online Social Media , 2014, IEEE Computational Intelligence Magazine.

[24]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[25]  Emanuele Della Valle,et al.  An Introduction to Information Retrieval , 2013 .

[26]  Chien-Liang Liu,et al.  An HMM-Based Algorithm for Content Ranking and Coherence-Feature Extraction , 2013, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[27]  Chung-Hsien Wu,et al.  Code-Switching Event Detection by Using a Latent Language Space Model and the Delta-Bayesian Information Criterion , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28]  Thomas L. Griffiths,et al.  Probabilistic Topic Models , 2007 .

[29]  Yuhong Xiong,et al.  Erratum to "Mining Distinction and Commonality across Multiple Domains Using Generative Model for Text Classification" , 2012, IEEE Trans. Knowl. Data Eng..

[30]  Yue Xu,et al.  Pattern-based Topics for Document Modelling in Information Filtering , 2014, IEEE Transactions on Knowledge and Data Engineering.

[31]  Bin Ma,et al.  Modeling Latent Topics and Temporal Distance for Story Segmentation of Broadcast News , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[32]  Qin Yan,et al.  Prioritization of Disease Susceptibility Genes Using LSM/SVD , 2013, IEEE Transactions on Biomedical Engineering.

[33]  Murat Can Ganiz,et al.  Helmholtz principle based supervised and unsupervised feature selection methods for text mining , 2016, Inf. Process. Manag..