Cross-modal recognition of pictures and descriptions without test-appropriate encoding