Training of the singing voice: a multimodal feature extraction approach