Semi-supervised active learning in graphical domains

In a traditional machine learning task, the goal is training a classifier using only labeled data (data feature/label pairs) in order to be able to generalize on completely new data to be labeled by the classifier. Unluckily in many cases it is difficult, expensive or time consuming to obtain the labeled instances needed for training, also because we usually require a human supervisor to annotate lots of data to collect a significant training set. Moreover, in many cases we are not interested in generalization to any unseen example, but we just require to discover labels for a large quantity of unlabeled, but already available, data by using a small subset of labeled data. If the given scenario involves both these conditions, a semi-supervised learning algorithm can be exploited as a solution for the classification problem. Semi-supervised learning algorithms combine a large amount of unlabeled data and a available small set of labeled data, to build a reliable classifier. It is particularly interesting to focus on a sub-class of semi-supervised learning algorithms, that is graph-based semi-supervised learning. In this framework we represent data as a graph where the nodes represent the labeled and unlabeled examples in the dataset, and the edges are added according to a given similarity relationship between pairs of examples. A common feature of every graph-based method is the fact they are nonparametric, discriminative and transductive. However, a crucial issue is the very limited number of labeled (supervised) data points we have with respect to unlabeled points, so it is essential to have representative examples. In some cases the labeled data points are given, but there are also many scenarios where we only have a set of unlabeled data points and we can choose a limited number of them to built the labeled data set. In this paper we propose a graph-based semi-supervised active learning algorithm based on a reasonable choice for labeled data points, in oder to improve classification accuracy. In a typical semi-supervised learning problem we should consider a set of n data points X = {x 1 , x 2 , · · · , x n } defined on a d dimensional feature space, such that each point x i ∈ IR d , and a set of labels Y = {+1, −1}. We have an oracle function h which gives us the correct labelling for every data point we applied it to. So …