What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions