Identification of Generalized Communities with Semantics in Networks with Content

Discovery of communities in networks is a fundamental data analysis task. Recently, researchers have tried to improve its performance by exploiting node contents, and further interpret the communities using the derived semantics. However, the existing methods typically assume that the communities are assortative (i.e. members of each group are mostly connected to other members of the same group), and are unable to find the generalized community structure, e.g. structures with either assortative or disassortative communities (i.e. vertices of the same group have most of their connections outside their group), or a combination. In addition, these methods often assume that the network topology and node contents share the same group memberships, and thus cannot perform well when the contents mismatch with network structure. Also, they are limited to using only one topic to interpret each community. To address these two issues, we propose a new generative probabilistic model which is learned by using a nested expectation-maximization algorithm. It describes the generalized communities (based on network) and the content clusters (based on contents) separately, and further explores and models their correlation to improve as much as possible each of the communities and clusters based on the other. By depicting and utilizing this correlation, our model is not only robust with respect to the above problems, but is also able to interpret each community using more than one topic, which provides richer explanations. We validate the robustness of this proposed new approach on an artificial benchmark, and test its interpretability using a case study analysis. We finally show its definite superiority for community detection by comparing with seven state-of-the-art algorithms on eight real networks.