Integrating open government data with stratosphere for more transparency

Governments are increasingly publishing their data to enable organizations and citizens to browse and analyze the data. However, the heterogeneity of this Open Government Data hinders meaningful search, analysis, and integration and thus limits the desired transparency. In this article, we present the newly developed data integration operators of the Stratosphere parallel data analysis framework to overcome the heterogeneity. With declaratively specified queries, we demonstrate the integration of well-known government data sources and other large open data sets at technical, structural, and semantic levels. Furthermore, we publish the integrated data on the Web in a form that enables users to discover relationships between persons, government agencies, funds, and companies. The evaluation shows that linking person entities of different data sets results in a good precision of 98.3% and a recall of 95.2%. Moreover, the integration of large data sets scales well on up to eight machines.

[1]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[2]  Dominic Battré,et al.  Nephele/PACTs: a programming model and execution framework for web-scale analytical processing , 2010, SoCC '10.

[3]  Glenn Shafer,et al.  A Mathematical Theory of Evidence , 2020, A Mathematical Theory of Evidence.

[4]  Jonathan Cohen,et al.  Graph Twiddling in a MapReduce World , 2009, Computing in Science & Engineering.

[5]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[6]  Günter Ladwig,et al.  The Information Workbench. Interacting with the Web of Data , 2010, FIS 2010.

[7]  Felix Naumann,et al.  DuDe: The Duplicate Detection Toolkit , 2010 .

[8]  Shuai Ma,et al.  Interaction between Record Matching and Data Repairing , 2014, JDIQ.

[9]  Dominic Battré,et al.  Massively parallel data analysis with PACTs on Nephele , 2010, Proc. VLDB Endow..

[10]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[11]  Andreas Thor,et al.  Parallel Sorted Neighborhood Blocking with MapReduce , 2011, BTW.

[12]  James A. Hendler,et al.  TWC LOGD: A portal for linked open government data ecosystems , 2011, J. Web Semant..

[13]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[14]  Rares Vernica,et al.  Hyracks: A flexible and extensible foundation for data-intensive computing , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[15]  Dennis Shasha,et al.  AJAX: an extensible data cleaning tool , 2000, SIGMOD '00.

[16]  Volker Markl,et al.  MapReduce and PACT - Comparing Data Parallel Programming Models , 2011, BTW.

[17]  Vinay Setty,et al.  Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing) , 2010, Proc. VLDB Endow..

[18]  Frederick Reiss,et al.  Towards a Scalable Enterprise Content Analytics Platform , 2009, IEEE Data Eng. Bull..

[19]  Jens Dittrich,et al.  iMeMex: From Search to Information Integration and Back , 2009, IEEE Data Eng. Bull..

[20]  Calvin Lin,et al.  Midas for government: Integration of government spending data on Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[21]  Odej Kao,et al.  Nephele: efficient parallel data processing in the cloud , 2009, MTAGS '09.

[22]  John Sheridan,et al.  Linking UK Government Data , 2010, LDOW.

[23]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[24]  Martin Gaedke,et al.  Discovering and Maintaining Links on the Web of Data , 2009, SEMWEB.

[25]  Markus Freitag,et al.  Linking open government data: what journalists wish they had known , 2010, I-SEMANTICS '10.

[26]  Ioana Manolescu,et al.  Declarative XML Data Cleaning with XClean , 2007, CAiSE.