Learning Bidirectional Action-Language Translation with Limited Supervision and Incongruent Extra Input