Remote sensing (RS) scene classification is challenging due to changes in the scale and direction of scenes within a category. Bilinear pooling method can extract higher-order and spatial orderless information and has been shown to achieve impressive performance on various visual tasks. However, bilinear pooled features are high dimensional, which makes them impractical for subsequent processing, especially for the convolutional neural network (CNN) models with more channels in the final convolutional layer. To alleviate this shortcoming, an improved bilinear pooling method is proposed to build the compact bilinear CNN model in this work. Specifically, a joint pooling method is proposed to reduce the high-dimensional bilinear features, and it can be embedded in a bilinear CNN architecture for end-to-end optimization. Through the experimental evaluation of three real RS scene image data sets, it is proved that the improved bilinear pooling method can obtain features with higher discriminative power than the bilinear pooling method but with lower dimensionality. In addition, it also reduces the running time of model training.