Identifying Current Issues in Short Answer Grading

Given a query answer (e.g. “The entire program.”, “main() function.”), the task is to evaluate the correctness of the answer with respect to the reference answer (e.g. 5, 0). SAG is expected to be useful in many real-world applications such as automated assessment of student answers in examinations. In recent years, a number of datasets have been released such as SciEntsBank [3] and X-CSD [5], which leads to creating a number of computational models for SAG [6, 1]. However, the performance of SAG is still limited, which hampers applying SAG to realworld applications. For example, a state-of-the-art system for SciEntsBank achieved 0.643 of weighted F1 score in 5-ways scoring [7]. Furthermore, it has not been explored what issues remain for creating a better SAG system yet in the literature. This paper aims at making these issues clear. For this aim, we create a simple SAG system which is easily analyzable but comparable to the state-of-the-art systems. We employ a simple k-Nearest Neighbors (kNN)-based system, where the instances, namely answers, are simply represented by additive word vectors. Our experiments show that the kNN-based system achieves reasonable performance compared to the state-of-the-art approaches. In addition, our detailed analysis of the system’s behavior highlights some remaining issues of SAG.