Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?