This paper proposes a novel Visual-Semantic Transformer (VST) to detect face forgery based on semantic aware feature relations. In face images, intrinsic feature relations exist between different semantic parsing regions. We find that face forgery algorithms always change such relations. Therefore, we start the approach by extracting Contextual Feature Sequence (CFS) using a transformer encoder to make the best abnormal feature relation patterns. Meanwhile, images are segmented as soft face regions by a face parsing module. Then we merge the CFS and the soft face regions as Visual Semantic Sequences (VSS) representing features of semantic regions. The VSS is fed into the transformer decoder, in which the relations in the semantic region level are modeled. Our method achieved 99.58% accuracy on FF++(Raw) and 96.16% accuracy on Celeb-DF. Extensive experiments demonstrate that our framework outperforms or is comparable with state-of-the-art detection methods, especially towards unseen forgery methods.