Graph Convolutional Networks have been successfully applied in skeleton-based action recognition. The key is fully exploring the spatial-temporal context. This letter proposes a Focusing-Diffusion Graph Convolutional Network (FDGCN) to address this issue. Each skeleton frame is first decomposed into two opposite-direction graphs for subsequent focusing and diffusion processes. Next, the focusing process generates a spatial-level representation for each frame individually by an attention module. This representation is regarded as a supernode to aggregate the feature from each joint node in each frame for spatial context extraction. After generating supernodes for the entire sequence, a transformer encoder layer is proposed to capture the temporal context further. Finally, these supernodes pass the embedded spatial-temporal context back to the spatial joints through the diffusion graph in the diffusing process. Extensive experiments on the NTU RGB+D and Skeleton-Kinetics benchmarks demonstrate the effectiveness of our approach.