Grouping of Customer Opinions Written in Natural Language Using Unsupervised Machine Learning

Among one of the current and most topical tasks in the area of textual documents processing belongs the problem of automatic categorization. Clustering as the most common form of unsupervised learning enables automatic grouping of unlabeled documents into subsets called clusters. In this paper, the authors are concerned with results of clustering of very large electronic real-world data collections containing customers' reviews written freely, in English as a natural language. The reviews are automatically clustered into two groups that should contain either positive or negative reviews. The paper focuses on the analysis why certain reviews are assigned wrongly to a group containing mostly reviews of a different class. The assignment of a review into a certain cluster is based on its properties, i.e., on the words that appeared in the review. Thus, words appearing in incorrectly categorized reviews were analyzed. It was found that words that are important from the correct classification viewpoint (and thus bearing some sentiment) are often similarly important as the words in a different set than expected, therefore do not take effect as misleading information unlike words that are much more or quite insignificant.