论文信息 - The Linguistic Status of Predictions and Feature Ranks from SVM Text Classifiers

The Linguistic Status of Predictions and Feature Ranks from SVM Text Classifiers

Text classification systems are capable of predicting certain characteristics of a text’s author (e.g., gender and age) using only linguistic properties. This paper asks why such predictions are possible and how they can be interpreted. There are three factors: (1) the nature of the features used by the system; (2) the robustness of the predictions across time and genres; (3) the amount of data required for training and testing. Some classification predictions (e.g., gender) are based on non-content linguistic material that generalizes across time and genre. These classifications are characterized by stable performance and feature ranks, and permit linguistic interpretation.

Jonathan Dunn

[1] Yejin Choi,et al. Gender Attribution: Tracing Stylometric Evidence Beyond Topic and Genre , 2011, CoNLL.

[2] Arjun Mukherjee,et al. Improving Gender Classification of Blog Authors , 2010, EMNLP.

[3] Shlomo Argamon,et al. Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[4] Sara Rosenthal,et al. Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations , 2011, ACL.

[5] Carolyn Penstein Rosé,et al. Author Age Prediction from Text using Linear Regression , 2011, LaTeCH@ACL.