Our work on author identification and author profiling is based on the question: Can the number and the types of grammatical errors serve as indica- tors for a specific author or a group of people? In order to detect the grammatical errors we base our approach on the output of the open-source library Language- Tool. In the case of the author identification we transform the problem into a statistical test, where an unknown document is written by another author when the distribution of grammatical errors deviated from documents of a reference corpus. For author profiling we implemented an instance based classification ap- proach, namely a k-NN classifier, in combination with a Language Model where a text is assigned to a specific age or gender group where the according reference corpus contains the closest match. In the evaluation we found that for both sce- narios grammatical errors do perform better than the baseline and do capture an aspect of a writing style, which is not contained in more traditional features, like stylometric features or word n-grams.
[1]
Matthias Hagen,et al.
Overview of the 1st international competition on plagiarism detection
,
2009
.
[2]
Roman Kern,et al.
Vote/Veto Classification, Ensemble Clustering and Sequence Classification for Author Identification
,
2012,
CLEF.
[3]
R. Harald Baayen,et al.
How Variable May a Constant be? Measures of Lexical Richness in Perspective
,
1998,
Comput. Humanit..
[4]
Dan Klein,et al.
Accurate Unlexicalized Parsing
,
2003,
ACL.
[5]
Moshe Koppel,et al.
Exploiting Stylistic Idiosyncrasies for Authorship Attribution
,
2003
.
[6]
Efstathios Stamatatos,et al.
A survey of modern authorship attribution methods
,
2009,
J. Assoc. Inf. Sci. Technol..