Measuring Correlation Between Linguist's Judgments and Latent Dirichlet Allocation Topics

Data that has been annotated by linguists is often considered a gold standard on many tasks in the NLP field. However, linguists are expensive so researchers seek automatic techniques that correlate well with human performance. Linguists working on the ScamSeek project were given the task of deciding how many and which document classes existed in this previously unseen corpus. This paper investigates whether the document classes identified by the linguists correlate significantly with Latent Dirichlet Allocation (LDA) topics induced from that corpus. Monte-Carlo simulation is used to measure the statistical significance of the correlation between LDA models and the linguists’ characterisations. In experiments, more than 90% of the linguists’ classes met the level required to declare the correlation between linguistic insights and LDA models is significant. These results help verify the usefulness of the LDA model in NLP and are a first step in showing that the LDA model can replace the efforts of linguists in certain tasks like subdividing a corpus into classes.