Data that has been annotated by linguists is often considered a gold standard on many tasks in the NLP field. However, linguists are expensive so researchers seek automatic techniques that correlate well with human performance. Linguists working on the ScamSeek project were given the task of deciding how many and which document classes existed in this previously unseen corpus. This paper investigates whether the document classes identified by the linguists correlate significantly with Latent Dirichlet Allocation (LDA) topics induced from that corpus. Monte-Carlo simulation is used to measure the statistical significance of the correlation between LDA models and the linguists’ characterisations. In experiments, more than 90% of the linguists’ classes met the level required to declare the correlation between linguistic insights and LDA models is significant. These results help verify the usefulness of the LDA model in NLP and are a first step in showing that the LDA model can replace the efforts of linguists in certain tasks like subdividing a corpus into classes.
[1]
Thomas L. Griffiths,et al.
Probabilistic Topic Models
,
2007
.
[2]
Michael I. Jordan,et al.
Probabilistic models of text and images
,
2004
.
[3]
Karl W Broman,et al.
Simulation-based P values: response to North et al.
,
2003,
American journal of human genetics.
[4]
Eric R. Ziegel,et al.
Probability and Statistics for Engineering and the Sciences
,
2004,
Technometrics.
[5]
Jon Patrick.
The Scamseek Project - Text Mining for Financial Scams on the Internet
,
2006,
Selected Papers from AusDM.
[6]
Warren J. Ewens,et al.
On Estimating P Values by Monte Carlo Methods
,
2003
.
[7]
Wray L. Buntine,et al.
Graphical models for discovering knowledge
,
1996,
KDD 1996.