Recently, convolution neural network (CNN)-based hyperspectral image (HSI) classification has enjoyed high popularity due to its appealing performance. However, using 2-D or 3-D convolution in a standalone mode may be suboptimal in real applications. On the one hand, the 2-D convolution overlooks the spectral information in extracting feature maps. On the other hand, the 3-D convolution suffers from heavy computation in practice and seems to perform poorly in scenarios having analogous textures along with consecutive spectral bands. To solve these problems, we propose a mixed CNN with covariance pooling for HSI classification. Specifically, our network architecture starts with spectral-spatial 3-D convolutions that followed by a spatial 2-D convolution. Through this mixture operation, we fuse the feature maps generated by 3-D convolutions along the spectral bands for providing complementary information and reducing the dimension of channels. In addition, the covariance pooling technique is adopted to fully extract the second-order information from spectral-spatial feature maps. Motivated by the channel-wise attention mechanism, we further propose two principal component analysis (PCA)-involved strategies, channel-wise shift and channel-wise weighting, to highlight the importance of different spectral bands and recalibrate channel-wise feature response, which can effectively improve the classification accuracy and stability, especially in the case of limited sample size. To verify the effectiveness of the proposed model, we conduct classification experiments on three well-known HSI data sets, Indian Pines, University of Pavia, and Salinas Scene. The experimental results show that our proposal, although with less parameters, achieves better accuracy than other state-of-the-art methods.