Feature extraction for document text using Latent Dirichlet Allocation

Feature extraction is one of stages in the information retrieval system that used to extract the unique feature values of a text document. The process of feature extraction can be done by several methods, one of which is Latent Dirichlet Allocation. However, researches related to text feature extraction using Latent Dirichlet Allocation method are rarely found for Indonesian text. Therefore, through this research, a text feature extraction will be implemented for Indonesian text. The research method consists of data acquisition, text pre-processing, initialization, topic sampling and evaluation. The evaluation is done by comparing Precision, Recall and F-Measure value between Latent Dirichlet Allocation and Term Frequency Inverse Document Frequency KMeans which commonly used for feature extraction. The evaluation results show that Precision, Recall and F-Measure value of Latent Dirichlet Allocation method is higher than Term Frequency Inverse Document Frequency KMeans method. This shows that Latent Dirichlet Allocation method is able to extract features and cluster Indonesian text better than Term Frequency Inverse Document Frequency KMeans method.

[1]  Murat Can Ganiz,et al.  Helmholtz principle based supervised and unsupervised feature selection methods for text mining , 2016, Inf. Process. Manag..

[2]  Huanguo Zhang,et al.  Design and implementation of Weibo sentiment analysis based on LDA and dependency parsing , 2016, China Communications.

[3]  Michelangelo Ceci,et al.  Ranking Sentences for Keyphrase Extraction: A Relational Data Mining Approach , 2014, IRCDL.

[4]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[5]  Hairong Qi,et al.  Friendbook: A Semantic-Based Friend Recommendation System for Social Networks , 2015, IEEE Transactions on Mobile Computing.

[6]  Yueting Zhuang,et al.  Graph Regularized Feature Selection with Data Reconstruction , 2016, IEEE Transactions on Knowledge and Data Engineering.

[7]  Paulo Cortez,et al.  Business intelligence in banking: A literature analysis from 2002 to 2013 using text mining and latent Dirichlet allocation , 2015, Expert Syst. Appl..

[8]  Gregor Heinrich Parameter estimation for text analysis , 2009 .

[9]  Sungjoo Lee,et al.  Keyword selection and processing strategy for applying text mining to patent analysis , 2015, Expert Syst. Appl..

[10]  Chong Wang,et al.  Nested Hierarchical Dirichlet Processes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Made Sudarma,et al.  Fuzzy-Gibbs latent Dirichlet allocation model for feature extraction on Indonesian documents , 2017 .

[12]  Yang Li,et al.  Interpreting the Public Sentiment Variations on Twitter , 2014, IEEE Transactions on Knowledge and Data Engineering.

[13]  Hua Xu,et al.  Implicit feature identification in Chinese reviews using explicit topic mining model , 2015, Knowl. Based Syst..