PolicyMiner: From Oysters to Pearls

Today, a historically unprecedented volume of data is available in the public domain with the potential of becoming useful for researchers. More than at any other time before, political parties and governments are making data available such as speeches, legislative bills and acts. However, as the size of available data increases, the need for sophisticated tools for web-harvesting and data analysis simultaneously grows. Yet, for the most part researchers who are developing these tools come from a computer science background, while researchers in the social and behavior sciences who have an interest in using such tools often lack the necessary training to apply these tools themselves.In order to provide a bridge between these two communities we propose a new tool called PolicyMiner. The objective of this tool is twofold: First, to provide a general purpose web-harvesting and data clean-up tool which can be used with relative ease by researchers with limited technical backgrounds. The second objective is to implement knowledge discovery algorithms that can be applied to textual data, such as legislative acts. With our paper we present a technical document which details the steps of data processing that have been implemented in the PolicyMiner. First, the PolicyMiner harvests the raw html data from publically available websites, such as governmental sites, and provides a unique integrated view for the data. Second, it cleans the data by removing irrelevant items, such as html tags and non-informative terms. Third, it classi es the harvested data according to a pre-de ned standard conceptual hierarchy relying on the Eurovoc thesaurus. Fourth, it applies di fferent knowledge discovery algorithms such as time series and correlation-based analysis to capture the temporal and substantive policy dependencies of the textual data across countries.

[1]  Bruno Pouliquen,et al.  Automatic annotation of multilingual text collections with a conceptual thesaurus , 2006, ArXiv.

[2]  Dorian Pyle,et al.  Data Preparation for Data Mining , 1999 .

[3]  Marcello Carammia,et al.  Policy Punctuations and Issue Diversity on the European Council Agenda , 2012 .

[4]  Ralph Johnson,et al.  design patterns elements of reusable object oriented software , 2019 .

[5]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[6]  Nizar Grira,et al.  Unsupervised and Semi-supervised Clustering : a Brief Survey ∗ , 2004 .

[7]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[8]  C. Granger Investigating causal relations by econometric models and cross-spectral methods , 1969 .

[9]  Will Jennings,et al.  Comparing Government Agendas , 2011 .

[10]  Bryan D. Jones,et al.  Representation and agenda-setting , 2004 .

[11]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[12]  Peter P. Chen The Entity-Relationship Model: Towards a unified view of Data , 1976 .

[13]  Peter P. Chen The entity-relationship model: toward a unified view of data , 1975, VLDB '75.

[14]  Will Jennings,et al.  Punctuations and Turning Points in British Politics: The Policy Agenda of the Queen's Speech, 1940-2005 , 2010 .

[15]  Will Jennings,et al.  Comparative Political Studies , 1999 .

[16]  Ivan Bruha Pre- and Post-processing in Machine Learning and Data Mining , 2001, Machine Learning and Its Applications.

[17]  Sandra L. Resodihardjo,et al.  Political Attention in a Coalition System: Analysing Queen's Speeches in the Netherlands 1945–2007 , 2009 .

[18]  Dimitris Kanellopoulos,et al.  Data Preprocessing for Supervised Leaning , 2007 .