Speech emotion recognition (SER) is a hot topic in academia. One of the key issues in improving the performance of SER systems is the choice of speech emotion features. In order to establish a robust speech emotion recognition system, it is essential to select the features which can be a perfect representation of speech emotion attributes. Researchers has done a lot of work, proposed a variety of emotional features and made great progress. Although each kind of features were proven to be effective, most of methods are based on a single type. In this paper, we proposed a method of feature fusion based on deep learning, combining spectral-based features and pitch-based hyper-prosodic features. The experiments show that this method improves the performance of speech emotion recognition system.