The notion of similarity is crucial to a number of tasks and methods in machine learning and data mining, including clustering and nearest neighbor classification. In many contexts, there is on the one hand a natural (but not necessarily optimal) similarity measure defined on the objects to be clustered or classified, but there is also information about which objects are linked together. This raises the question to what extent the information contained in the links can be used to obtain a more relevant similarity measure. Earlier research has already shown empirically that more accurate results can be obtained by including such link information, but it was not analyzed why this is the case. In this paper we provide such an analysis. We relate the extent to which improved results can be obtained to the notions of homophily in the network, transitivity of similarity, and content variability of objects. We explore this relationship using some randomly generated datasets, in which we vary the amount of homophily and content variability. The results show that within a fairly wide range of values for these parameters, the inclusion of link information in the similarity measure indeed yields improved results, as compared to computing the similarity of objects directly from their content.
[1]
Hendrik Blockeel,et al.
K-Means Based Approaches to Clustering Nodes in Annotated Graphs
,
2011,
ISMIS.
[2]
Hong Cheng,et al.
Graph Clustering Based on Structural/Attribute Similarities
,
2009,
Proc. VLDB Endow..
[3]
Lise Getoor,et al.
Collective Classification in Network Data
,
2008,
AI Mag..
[4]
M. McPherson,et al.
Birds of a Feather: Homophily in Social Networks
,
2001
.
[5]
Hendrik Blockeel,et al.
A method to extend existing document clustering procedures in order to include relational information
,
2008,
MLG 2008.
[6]
Chinatsu Aone,et al.
Fast and effective text mining using linear-time document clustering
,
1999,
KDD '99.
[7]
Pedro M. Domingos,et al.
Entity Resolution with Markov Logic
,
2006,
Sixth International Conference on Data Mining (ICDM'06).