In this paper we report on the design of a pipeline involving Common Spatial Patterns (CSP), a signal processing approach commonly used in the field of electroencephalography (EEG), matrix representation of features and image classification to categorize videos taken by a humanoid robot. The ultimate goal is to endow the robot with action recognition capabilities for a more natural social interaction. Summarizing, we apply the CSP algorithm to a set of signals obtained for each video by extracting skeleton joints of the person performing the action. From the transformed signals a summary image is obtained for each video, and these images are then classified using two different approaches; global visual descriptors and convolutional neural networks. The presented approach has been tested on two data sets that represent two scenarios with common characteristics. The first one is a data set with 46 individuals performing 6 different actions. In order to create the group of signals of each video, OpenPose has been used to extract the skeleton joints of the person performing the actions. The second data set is an Argentinian Sign Language data set (LSA64) from which the signs performed using just the right hand have been used. In this case the joint signals have been obtained using MediaPipe. The results obtained with the presented method have been compared with a Long Short-Term Memory (LSTM) method, achieving promising results.