Evaluation of Question Answering Systems: Complexity of judging a natural language