论文信息 - On the importance of similarity measures for planning to learn

On the importance of similarity measures for planning to learn

Data analysis is a complex process that consists of finding a suitable data representation, a suitable machine learning method, and using a suitable evaluation metric (one that reflects what the user is really interested in). All these choices are crucial from the “planning to learn” perspective, and none are trivial. In this paper we focus on the first of these three, the input space representation. Sometimes this problem is posed as “defining the right features”, but in those cases where we have non-standard data, for instance, for relational or graph data, the data representation problem does not map easily on feature construction. In some sense, it is easier to see it as a problem of constructing a suitable distance metric, similarity metric, or kernel. In this paper we discuss this view in some more detail. We next illustrate it by looking at input data represented as annotated graphs, and defining a few similarity measures in this context. We illustrate the importance of the choice of distance measure with an experiment on the Cora dataset.

Hendrik Blockeel | Hossein Rahmani | Tijn Witsenburg

[1] Andrew McCallum,et al. Automating the Construction of Internet Portals with Machine Learning , 2000, Information Retrieval.

[2] Hendrik Blockeel,et al. A method to extend existing document clustering procedures in order to include relational information , 2008, MLG 2008.