Analysing Utterances in Polish Parliament to Predict Speaker’s Background*

Abstract In this study we use transcripts of the Sejm (Polish parliament) to predict speaker’s background: gender, education, party affiliation and birth year. We create learning cases consisting of 100 utterances by the same author and, using rich multi-level annotations of the source corpus, extract a variety of features from them. They are either text-based (e.g. mean sentence length, percentage of long words or frequency of named entities of certain types) or word-based (unigrams and bigrams of surface forms, lemmas and interpretations). Next, we apply general-purpose feature selection, regression and classification algorithms and obtain results well over the baseline (97% of accuracy for gender, 95% for education, 76–88% for party). Comparative study shows that random forest and k nearest neighbour’s classifier usually outperform other methods commonly used in text mining, such as support vector machines and naïve Bayes classifier. Performed evaluation experiments help to understand how these solutions deal with such sparse and highly-dimensional data and which of the considered traits influence the language the most. We also address difficulties caused by some of the properties of Polish, typical also for other Slavonic languages.

[1]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[2]  Mirosław Bańko,et al.  Narodowy Korpus Języka Polskiego , 2012 .

[3]  J. Bollen,et al.  More Tweets, More Votes: Social Media as a Quantitative Indicator of Political Behavior , 2013, PloS one.

[4]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[5]  Michal Meina,et al.  Ensemble-based Classification for Author Profiling Using Various Features Notebook for PAN at CLEF 2013 , 2013, CLEF.

[6]  Vasudeva Varma,et al.  Author Profiling: Predicting Age and Gender from Blogs Notebook for PAN at CLEF 2013 , 2013, CLEF.

[7]  Adam Przepiórkowski,et al.  XML Text Interchange Format in the National Corpus of Polish , 2011 .

[8]  Runze Li,et al.  Feature Screening via Distance Correlation Learning , 2012, Journal of the American Statistical Association.

[9]  Ron Wehrens,et al.  The pls Package: Principal Component and Partial Least Squares Regression in R , 2007 .

[10]  Arjun Mukherjee,et al.  Improving Gender Classification of Blog Authors , 2010, EMNLP.

[11]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[12]  Adam Przepiórkowski,et al.  Towards the Annotation of Named Entities in the National Corpus of Polish , 2010, LREC.

[13]  Adam Przepiórkowski,et al.  A Flexemic Tagset for Polish , 2003 .

[14]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[15]  Stefan Kaufmann,et al.  Language and Ideology in Congress , 2011, British Journal of Political Science.

[16]  Walter Daelemans,et al.  Predicting age and gender in online social networks , 2011, SMUC '11.

[17]  Sameer Singh,et al.  A Pilot Study on Gender Differences in Conversational Speech on Lexical Richness Measures , 2001, Lit. Linguistic Comput..

[18]  Michael Mitzenmacher,et al.  Detecting Novel Associations in Large Data Sets , 2011, Science.

[19]  Kurt Hornik,et al.  Misc Functions of the Department of Statistics (e1071), TU Wien , 2014 .

[20]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[21]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[22]  Brendan T. O'Connor,et al.  From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series , 2010, ICWSM.

[23]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[24]  Maciej Ogrodniczuk The Polish Sejm Corpus , 2012, LREC.

[25]  Jacob Ratkiewicz,et al.  Predicting the Political Alignment of Twitter Users , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[26]  Adam Przepiórkowski,et al.  Spejd: A Shallow Processing and Morphological Disambiguation Tool , 2009, LTC.

[27]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[28]  Darnes Vilariño Ayala,et al.  Two Methodologies Applied to the Author Profiling Task , 2013, CLEF.

[29]  Jacques Savoy,et al.  Lexical Analysis of US Political Speeches , 2010, J. Quant. Linguistics.

[30]  Mats Dahllöf Automatic prediction of gender, political affiliation, and age in Swedish politicians from the wording of their speeches - A comparative study of classifiability , 2012, Lit. Linguistic Comput..

[31]  Carolyn Penstein Rosé,et al.  Author Age Prediction from Text using Linear Regression , 2011, LaTeCH@ACL.

[32]  Szymon Acedanski,et al.  A Morphosyntactic Brill Tagger for Inflectional Languages , 2010, IceTAL.

[33]  Shlomo Argamon,et al.  Mining the Blogosphere: Age, gender and the varieties of self-expression , 2007, First Monday.

[34]  Marcin Wolinski,et al.  Morfeusz - a Practical Tool for the Morphological Analysis of Polish , 2006, Intelligent Information Systems.

[35]  Anat Rachel Shimoni,et al.  Gender, genre, and writing style in formal written texts , 2003 .

[36]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[37]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..