Criterial feature extraction using parallel learner corpora and machine learning

This study reports on a new approach in semi-automatic error annotation and criterial feature extraction from learner corpora. Parallel learner corpora, a set of original learner writings and their proofread counterparts, were processed using edit distance to automatically identify surface taxonomy errors, which were then statistically analysed to produce language features which serve as criterial for a particular language proficiency level. Two case studies will report on different statistical and machine learning techniques; a clustering technique called variability-based neighbour clustering and ensemble learning called random forest . The results of the two case studies show that using edit distance over parallel learner corpora is a promising direction for annotating a large quantity of learner data with minimum manual annotation work, and both statistical techniques were found to be effective in identifying criterial features from learner corpora. Some theoretical and methodological issues are discussed for further research.