Abstract Burrows' Delta Method (Burrows, 2002) is a leading method of authorship attribution. It can be used to shortlist potential authors from a list or to even identify potential authors. The technique has been extended by Hoover (2004a, 2006). In this investigation, we look at the choice of words for the word vector used, the size of the word vector, the similarity measure and the impact of corpus choice on the accuracy of text classification. Our results show a word frequency vector of between 200 and 300 words give the most accurate results (Aldridge, 2007). We also demonstrate a dramatic improvement in accuracy by adapting Burrows' Delta to the cosine similarity measure. Additionally, our results indicate areas where the word vector can be optimized still further for more accurate results.
[1]
Susan Brewer,et al.
Information storage and retrieval
,
1959,
ACM '59.
[2]
David L. Hoover,et al.
Testing Burrows's Delta
,
2004,
Lit. Linguistic Comput..
[3]
John Burrows,et al.
'Delta': a Measure of Stylistic Difference and a Guide to Likely Authorship
,
2002,
Lit. Linguistic Comput..
[4]
F. Mosteller,et al.
Inference and Disputed Authorship: The Federalist
,
1966
.
[5]
David L. Hoover,et al.
Delta Prime?
,
2004,
Lit. Linguistic Comput..
[6]
H. Gardner.
The New Oxford Book of English Verse 1250–1950
,
1972
.
[7]
Hichem Frigui,et al.
Simultaneous Clustering and Dynamic Keyword Weighting for Text Documents
,
2004
.