Co-op Training: a Semi-supervised Learning Method for Data Streams

Applying Machine learning algorithms to data streams is a challenging task because traditional strategies suppose datasets to be labeled, finite and stationary. In the context of data streams, where the data is generated in real-time and the labels may be missing due to the high cost of the labeling process, the proposal of semi-supervised learning (SSL) strategies to learn from labeled and unlabeled data at the same time seems to be a viable solution, despite also being challenging. In this paper, we present a novel approach to handle missing labels for classification learning in data streams, named co-op training, which is based on self-training incremental and co-training. In a controlled experiment, we execute the proposed algorithm, along with most well-known semi-supervised learning strategies, in 11 artificial and real-world datasets, and compare the results. We found our strategy to be more accurate than the other SSL algorithms in most datasets, also presenting better run-times when accuracies were similar. These methods are implemented in the Massive Online Analysis (MOA) open-source software as an internal benchmark component, to help researchers to run experimental comparisons on semi-supervised learning on data streams easily.