Generating Instructions in Virtual Environments (GIVE):A Challenge and an Evaluation Testbed for NLG

Would it be helpful or detrimental for the field of NLG to have a generally accepted competition? Competitions have definitely advanced the state of the art in some fields of NLP, but the benefits sometimes come at the price of over-competitiveness, and there is a danger of overfitting systems to the concrete evaluation metrics. Moreover, it has been argued that there are intrinsic difficulties in NLG that make it harder to evaluate than other NLP tasks (Scott and Moore, 2006). We agree that NLG is too diverse for a single “competition”, and there are no mutually accepted evaluation metrics. Instead, we suggest that all the positive aspects, and only a few of the negative ones, can be achieved by putting forth a challenge to the community. Research teams would implement systems that address various aspects of the challenge. These systems would then be evaluated regularly, and the results compared at a workshop. There would be no “winner” in the sense of a competition; rather, the focus should be on learning what works and what doesn’t, building upon the best ideas, and perhaps reusing the best modules for next year’s round. As a side effect, the exercise should result in a growing body of shareable tools and modules.