Evaluation of knowledge acquisition from document clustering based on information retrieval scales

Twitter is becoming one of the most important social sensors for observing the reputation and trends of events and things in the real world. Also the impression and reputation of enterprises on the public, information available on Twitter is effective in influencing opinions. In this study, we attempted to classify companies using tweets that included hash tags that corresponded to each company from language resources related to the companies accumulated on Twitter. However, there are differences in the number of tweets by companies, which may affect the performance of clustering. Therefore, by comparing TF-IDF which is a conventional method and BM25 considered in document length, it is confirmed whether difference in performance of companies clustering occurs. The collected tweets were weighted by information retrieval scale, and clustering result was evaluated by entropy. As a result, the peripheral method of BM 25 was shown to be effective in document clustering.