Depuration, augmentation and balancing of training data for supervised learning based detectors of EEG patterns

The development of automatic detectors for EEG patterns is often challenged by the quality and availability of training events. We have implemented data depuration, augmentation and balancing steps in the development process of a sleep-spindle detector and measured their effect on the detection performance. The training data depuration is based on kernelized k-means clustering and allowed re-grouping training events into a class with similar characteristics. The data augmentation utilizes the multi-channel expression of EEG patterns. The data balancing adjusts the size of the classes so that their size is the same. We worked with 27 EEG recordings which were segmented into epochs of 250ms, each epoch was then characterized by eight features; two EEG recordings were used for training, six for validation and 19 for testing. The depuration of non-augmented, balanced data reclassified 47% of the epochs within visual positive marks and 7% of the epochs outside visual positive marks as belonging to the opposite class. For the detection of single epochs from the validation set, the detector trained with non-augmented, un-balanced, depurated data showed the highest area under the precision-recall curve and the highest Matthews correlation coefficient. For the detection of sleep spindles on the test set, the depuration of non-augmented training data increased Matthews correlation coefficient by 64% and the un-balancing step an additional 1.9%.