A method for dance motion recognition and scoring using two-layer classifier based on conditional random field and stochastic error-correcting context-free grammar

This paper presents a unified framework for recognizing and scoring dance motion using 2-layer classifier so that computation complexity is distributed into two layers. This research examines the performance of sliding window, hidden Markov Model (HMM) and conditional random field (CRF) as the first layer classifier to segment the input video into a sequence of motion primitive label. The second layer classifier is stochastic error-correcting context-free grammar, built based on dance master knowledge, to parse the sequence of labels, builds a parse tree, and computes the accumulated dance score. The dataset for this research is captured using one Kinect camera. The training dataset is: 212 samples of 12 motion primitive samples and seven videos of Pendet dance performance. From 5-fold cross-validation, accuracy of sliding window, HMM, and CRF are 0.63, 0.79, and 0.86 respectively. This result shows that CRF achieves higher performance as a dance motion primitive recognizer than HMM as proposed by [1]. The CRF model achieves 0.88 of accuracy when motion feature is all skeleton joint angular coordinates as proposed by [2] but increases to 0.93 if the motion feature is only upper-body joint coordinates. Stochastic error-correcting context-free grammar is chosen as dance choreography model. The experiment using synthetic sequence label with cost factor ci=1 and error-sequence labels up to 50 percent shows the grammar can tolerate the input label sequence error up to 25 percent. The experiment using Pendet dance performances show that the average dance score is 79.3. The low dance score is due to several factors including: dance skill variation, unstable basic gesture repetition, high cost contributed by replacing deletion and substitution of local error by insertion operation, duration variation due the absence of timing guideline of body part motions, and limited training dataset to capture possible basic gesture variations.