ST$^{2}$: Spatial-Temporal State Transformer for Crowd-Aware Autonomous Navigation