Automatic thematic classification of election manifestos

We digitized three years of Dutch election manifestos annotated by the Dutch political scientist Isaac Lipschits. We used these data to train a classifier that can automatically label new, unseen election manifestos with themes. Having the manifestos in a uniform XML format with all paragraphs annotated with their themes has advantages for both electronic publishing of the data and diachronic comparative data analysis. The data that we created will be disclosed to the public through a search interface. This means that it will be possible to query the data and filter them on themes and parties. We optimized the Lipschits classifier on the task of classifying election manifestos using models trained on earlier years. We built a classifier that is suited for classifying election manifestos from 2002 onwards using the data from the 1980s and 1990s. We evaluated the results by having a domain expert manually assess a sample of the classified data. We found that our automatic classifier obtains the same precision as a human classifier on unseen data. Its recall could be improved by extending the set of themes with newly emerged themes. Thus when using old political texts to classify new texts, work is needed to link and expand the set of themes to newer topics.

[1]  Dustin Hillard,et al.  Automated classification of congressional legislation , 2006, DG.O.

[2]  Maarten Marx,et al.  Advanced Information Access to Parliamentary Debates , 2009, J. Digit. Inf..

[3]  Valentin Jijkoun,et al.  Electoral search using the VerkiezingsKijker: an experience report , 2007, WWW '07.

[4]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[5]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[6]  Ralf Steinberger,et al.  JRC Eurovoc Indexer JEX - A freely available multi-label categorisation tool , 2012, LREC.

[7]  Cornelis H. A. Koster,et al.  Multi-classification of Patent Applications with Winnow , 2003, Ershov Memorial Conference.

[8]  Walter Daelemans,et al.  An efficient memory-based morphosyntactic tagger and parser for Dutch , 2007, CLIN 2007.

[9]  Dustin Hillard,et al.  Computer-Assisted Topic Classification for Mixed-Methods Social Science Research , 2008 .

[10]  Suzan Verberne,et al.  Text Representations for Patent Classification , 2013, CL.

[11]  Stefan Kaufmann,et al.  Classifying Party Affiliation from Political Speech , 2008 .

[12]  George Forman,et al.  Tackling concept drift by temporal inductive transfer , 2006, SIGIR.

[13]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[14]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[15]  Adriano M. Pereira,et al.  Exploiting temporal contexts in text classification , 2008, CIKM '08.

[16]  Sandra L. Resodihardjo,et al.  Political Attention in a Coalition System: Analysing Queen's Speeches in the Netherlands 1945–2007 , 2009 .

[17]  Maarten Marx,et al.  The design of PoliDocs: a web information system for the disclosure of Dutch parliamentary publications , 2009 .

[18]  R. B. Andeweg,et al.  Political Parties and the Democratic Mandate , 2006 .

[19]  Tom Louwerse,et al.  Political parties and the democratic mandate : comparing collective mandate fulfilment in the United Kingdom and the Netherlands , 2011 .

[20]  Gerhard Weikum,et al.  Language-model-based pro/con classification of political text , 2010, SIGIR.

[21]  Cornelis H. A. Koster,et al.  Taming Wild Phrases , 2003, ECIR.

[22]  Cornelis H. A. Koster,et al.  On the Importance of Parameter Tuning in Text Categorization , 2006, Ershov Memorial Conference.

[23]  Stefan Kaufmann,et al.  Language and Ideology in Congress , 2011, British Journal of Political Science.

[24]  Ido Dagan,et al.  Mistake-Driven Learning in Text Categorization , 1997, EMNLP.

[25]  I. Budge,et al.  Mapping Policy Preferences: Estimates for Parties, Electors, and Governments 1945-1998 , 2001 .

[26]  H. Edelsbrunner,et al.  Efficient algorithms for agglomerative hierarchical clustering methods , 1984 .

[27]  Dragomir R. Radev,et al.  How to Analyze Political Attention with Minimal Assumptions and Costs , 2010 .

[28]  Wagner Meira,et al.  Understanding temporal aspects in document classification , 2008, WSDM '08.

[29]  Gosse Bouma,et al.  Accurate Stemming of Dutch for Text Classification , 2001, CLIN.

[30]  Bruno Pouliquen,et al.  Automatic annotation of multilingual text collections with a conceptual thesaurus , 2006, ArXiv.

[31]  Maarten Marx,et al.  Focused retrieval and result aggregation with political data , 2010, Information Retrieval.