Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation