Improving the Accuracy of Similarity Measures by Using Link Information

The notion of similarity is crucial to a number of tasks and methods in machine learning and data mining, including clustering and nearest neighbor classification. In many contexts, there is on the one hand a natural (but not necessarily optimal) similarity measure defined on the objects to be clustered or classified, but there is also information about which objects are linked together. This raises the question to what extent the information contained in the links can be used to obtain a more relevant similarity measure. Earlier research has already shown empirically that more accurate results can be obtained by including such link information, but it was not analyzed why this is the case. In this paper we provide such an analysis. We relate the extent to which improved results can be obtained to the notions of homophily in the network, transitivity of similarity, and content variability of objects. We explore this relationship using some randomly generated datasets, in which we vary the amount of homophily and content variability. The results show that within a fairly wide range of values for these parameters, the inclusion of link information in the similarity measure indeed yields improved results, as compared to computing the similarity of objects directly from their content.