On the importance of similarity measures for planning to learn

Data analysis is a complex process that consists of finding a suitable data representation, a suitable machine learning method, and using a suitable evaluation metric (one that reflects what the user is really interested in). All these choices are crucial from the “planning to learn” perspective, and none are trivial. In this paper we focus on the first of these three, the input space representation. Sometimes this problem is posed as “defining the right features”, but in those cases where we have non-standard data, for instance, for relational or graph data, the data representation problem does not map easily on feature construction. In some sense, it is easier to see it as a problem of constructing a suitable distance metric, similarity metric, or kernel. In this paper we discuss this view in some more detail. We next illustrate it by looking at input data represented as annotated graphs, and defining a few similarity measures in this context. We illustrate the importance of the choice of distance measure with an experiment on the Cora dataset.