Layer-wise enhanced transformer with multi-modal fusion for image caption