On the Relevance of Syntactic and Discourse Features for Author Profiling and Identification

The majority of approaches to author profiling and author identification focus mainly on lexical features, i.e., on the content of a text. We argue that syntactic and discourse features play a significantly more prominent role than they were given in the past. We show that they achieve state-of-the-art performance in author and gender identification on a literary corpus while keeping the feature set small: the used feature set is composed of only 188 features and still outperforms the winner of the PAN 2014 shared task on author verification in the literary genre.

[1]  Richard Johansson,et al.  The CoNLL 2008 Shared Task on Joint Parsing of Syntactic and Semantic Dependencies , 2008, CoNLL.

[2]  David Crystal,et al.  Investigating English Style , 1969 .

[3]  Shlomo Argamon,et al.  Automatically profiling the author of an anonymous text , 2009, CACM.

[4]  Arjun Mukherjee,et al.  Improving Gender Classification of Blog Authors , 2010, EMNLP.

[5]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[6]  Hsinchun Chen,et al.  Applying authorship analysis to extremist-group Web forum messages , 2005, IEEE Intelligent Systems.

[7]  Yejin Choi,et al.  Gender Attribution: Tracing Stylometric Evidence Beyond Topic and Genre , 2011, CoNLL.

[8]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[9]  Ameer Al-Nemrat,et al.  Identifying Cyber Predators through Forensic Authorship Analysis of Chat Logs , 2012, 2012 Third Cybercrime and Trustworthy Computing Workshop.

[10]  Michael Gamon,et al.  Linguistic correlates of style: authorship classification with deep linguistic analysis features , 2004, COLING.

[11]  Azadeh Shakery,et al.  Authorship Identification Using Dynamic Selection of Features from Probabilistic Feature Set , 2014, CLEF.

[12]  Adriana Kovashka,et al.  Authorship Attribution Using Probabilistic Context-Free Grammars , 2010, ACL.

[13]  Refat Aljumily Hierarchical and Non-Hierarchical Linear and Non-Linear Clustering Methods to “Shakespeare Authorship Question” , 2015 .

[14]  F. Mosteller,et al.  Inference in an Authorship Problem , 1963 .

[15]  William C. Mann,et al.  Rhetorical Structure Theory: Toward a functional theory of text organization , 1988 .

[16]  F. Mosteller,et al.  A comparative study of discrimination methods applied to the authorship of the disputed Federalist papers , 2016 .

[17]  Daniel Marcu,et al.  Finding the WRITE Stuff: Automatic Identification of Discourse Structure in Student Essays , 2003, IEEE Intell. Syst..

[18]  Bernd Bohnet,et al.  Very high accuracy and fast dependency parsing is not a contradiction , 2010, COLING 2010.

[19]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[20]  Efstathios Stamatatos,et al.  Overview of the Author Identification Task at PAN 2013 , 2013, CLEF.

[21]  Graeme Hirst,et al.  A Computational Theory of Goal-Directed Style in Syntax , 1993, Comput. Linguistics.

[22]  Pashutan Modaresi,et al.  A Language Independent Author Verifier Using Fuzzy C-Means Clustering , 2014, CLEF.

[23]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[24]  Benno Stein,et al.  Overview of the Author Identification Task at PAN-2017: Style Breach Detection and Author Clustering , 2017, CLEF.

[25]  References , 1971 .

[26]  Mihai Surdeanu,et al.  Two Practical Rhetorical Structure Theory Parsers , 2015, NAACL.