Focused retrieval and result aggregation with political data

This paper presents a case-study in which we use a large semi-structured data set consisting of official transcripts of meetings of the Dutch parliament for focused retrieval and result aggregation. Transcripts of meetings are a document genre characterized by a complex narrative structure. The essence is not only what is said, but also by who and to whom. We have notes of more than 40 years of Dutch parliamentary debates where this structure is exploited to automatically make semantic annotations. These annotations yield numerous new ways of searching, browsing, mining and summarizing these documents. Concerning result aggregation, we summarise and visualise the structure of meetings into tables of content and interruption graphs. The contents of meetings or parts of meetings are condensed into word clouds that are created using a parsimonious language model. Furthermore, we have developed a search engine that exploits the structure and annotations of our data making it possible to provide entry points, to group search results, and to use faceted search techniques for data-exploration. Evaluation shows that our content and structure summarization tools provide a good first impression of a debate. Users reported that, compared to a standard document retrieval system, our search engine gives a better overview of the data. Search tasks are performed faster and the users felt more certain of their answers.

[1]  Maarten Marx,et al.  Digital weight watching: reconstruction of scanned documents , 2009, AND '09.

[2]  Janet Seaton,et al.  The Scottish Parliament and e-democracy , 2005, Aslib Proc..

[3]  Robert Hariman,et al.  Political Style: The Artistry of Power , 1995 .

[4]  Rens Vliegenthart,et al.  Divergent framing: The public debate on migration in the Dutch parliament and media, 1995–2004 , 2007 .

[5]  Torsten Grust,et al.  MonetDB/XQuery: a fast XQuery processor powered by a relational engine , 2006, SIGMOD Conference.

[6]  Marti A. Hearst,et al.  Finding the flow in web site search , 2002, CACM.

[7]  Joost Berkhout,et al.  The Politics of Attention: How Government Prioritizes Problems , 2008 .

[8]  Andrew Trotman,et al.  Narrowed Extended XPath I (NEXI) , 2004, INEX.

[9]  Mounia Lalmas,et al.  Evaluating XML retrieval effectiveness at INEX , 2007, SIGF.

[10]  Raghu Ramakrishnan,et al.  Managing information extraction: state of the art and research directions , 2006, SIGMOD Conference.

[11]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[12]  Fernando Pereira,et al.  Generating summary keywords for emails using topics , 2008, IUI '08.

[13]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[14]  M. de Rijke,et al.  Articulating information needs in XML query languages , 2006, TOIS.

[15]  Joann J. Ordille,et al.  Data integration: the teenage years , 2006, VLDB.

[16]  Djoerd Hiemstra,et al.  Parsimonious language models for information retrieval , 2004, SIGIR '04.

[17]  Joann J. Ordille,et al.  Querying Heterogeneous Information Sources Using Source Descriptions , 1996, VLDB.

[18]  Peter Ingwersen,et al.  The development of a method for the evaluation of interactive information retrieval systems , 1997, J. Documentation.

[19]  Birger Larsen,et al.  Report on the INEX 2004 interactive track , 2005, SIGF.

[20]  Djoerd Hiemstra,et al.  PFTijah: text search in an XML database system , 2006 .

[21]  Georgia Koutrika,et al.  Data clouds: summarizing keyword search results over structured data , 2009, EDBT '09.

[22]  Benjamin M. Good,et al.  Tag clouds for summarizing web search results , 2007, WWW '07.

[23]  Krisztian Balog,et al.  People search in the enterprise , 2007, SIGF.

[24]  Wessel Kraaij,et al.  Variations on language modeling for information retrieval , 2005, SIGF.

[25]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[26]  Börkur Sigurbjörnsson,et al.  Focused information access using XML element retrieval , 2006 .

[28]  Maarten Marx,et al.  Who said what to whom?: capturing the structure of debates , 2009, SIGIR.

[29]  Simon Buckingham Shum,et al.  The Roots of Computer Supported Argument Visualization , 2003, Visualizing Argumentation.

[30]  Simon Buckingham Shum,et al.  Visualizing Argumentation: Software Tools for Collaborative and Educational Sense-Making , 2012 .

[31]  Andrew Trotman,et al.  The Simplest Query Language That Could Possibly Work , 2004 .

[32]  Paul Rayson,et al.  Comparing Corpora using Frequency Profiling , 2000, Proceedings of the workshop on Comparing corpora -.

[33]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[34]  Maarten A. Hajer,et al.  Setting the Stage , 2005, Strategy and Command.

[35]  Ross Wilkinson,et al.  Searcher performance in question answering , 2001, SIGIR '01.

[36]  Maarten Marx,et al.  Exemelification of parliamentary debates , 2009 .

[37]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.