IOTA: Interlinking of heterogeneous multilingual open fiscal DaTA

Abstract Open budget data are among the most frequently published datasets of the open data ecosystem, intended to improve public administrations and government transparency. Unfortunately, the prospects of analysis across different open budget data remain limited due to schematic and linguistic differences. Budget and spending datasets are published together with descriptive classifications. Various public administrations typically publish the classifications and concepts in their regional languages. These classifications can be exploited to perform a more in-depth analysis, such as comparing similar items across different, cross-lingual datasets. However, in order to enable such analysis, a mapping across the multilingual classifications of datasets is required. In this paper, we present the framework for Interlinking of Heterogeneous Multilingual Open Fiscal DaTA (IOTA). IOTA makes use of machine translation followed by string similarities to map concepts across different datasets. To the best of our knowledge, IOTA is the first framework to offer scalable implementation of string similarity using distributed computing. The results demonstrate the applicability of the proposed multilingual matching, the scalability of the proposed framework, and an in-depth comparison of string similarity measures.

[1]  Vijaymeena M.K,et al.  A Survey on Similarity Measures in Text Mining , 2016 .

[2]  José Maria Parente de Oliveira,et al.  DIGO: An Open Data Architecture for e-Government , 2011, 2011 IEEE 15th International Enterprise Distributed Object Computing Conference Workshops.

[3]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[4]  Axel-Cyrille Ngonga Ngomo,et al.  MAG: A Multilingual, Knowledge-base Agnostic and Deterministic Entity Linking Approach , 2017, K-CAP.

[5]  Sören Auer,et al.  "How Much?" is not Enough: an Analysis of Open Budget Initiatives , 2015, ICEGOV.

[6]  James A. Hendler,et al.  Data-gov Wiki: Towards Linking Government Data , 2010, AAAI Spring Symposium: Linked Data Meets Artificial Intelligence.

[7]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[8]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[9]  Divesh Srivastava,et al.  Group Linkage , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[10]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[11]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[12]  T. Sørensen,et al.  A method of establishing group of equal amplitude in plant sociobiology based on similarity of species content and its application to analyses of the vegetation on Danish commons , 1948 .

[13]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[14]  Alon Y. Halevy,et al.  Principles of Data Integration , 2012 .

[15]  Ilaria Bartolini,et al.  String Matching with Metric Trees Using an Approximate Distance , 2002, SPIRE.

[16]  Tatiana Lesnikova,et al.  Liage de données RDF : évaluation d'approches interlingues. (RDF Data Interlinking : evaluation of Cross-lingual Methods) , 2016 .

[17]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[18]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[19]  Heiner Stuckenschmidt,et al.  Ontology-Based Integration of Information - A Survey of Existing Approaches , 2001, OIS@IJCAI.

[20]  Wen-Syan Li,et al.  String Similarity Joins: An Experimental Evaluation , 2014, Proc. VLDB Endow..

[21]  Ondrej Sváb-Zamazal,et al.  Alignment: A Hybrid, Interactive and Collaborative Ontology and Entity Matching Service , 2018, Inf..

[22]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[23]  Roi Blanco,et al.  Lightweight Multilingual Entity Extraction and Linking , 2017, WSDM.

[24]  Sören Auer,et al.  A systematic review of open government data initiatives , 2015, Gov. Inf. Q..

[25]  P. Jaccard,et al.  Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .

[26]  Martin Gaedke,et al.  Silk - A Link Discovery Framework for the Web of Data , 2009, LDOW.

[27]  Justin Zobel,et al.  Phonetic string matching: lessons from information retrieval , 1996, SIGIR '96.

[28]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[29]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[30]  Maria-Esther Vidal,et al.  Experience: Open Fiscal Datasets, Common Issues, and Recommendations , 2018, JDIQ.

[31]  Hugh Glaser,et al.  Linked Open Government Data: Lessons from Data.gov.uk , 2012, IEEE Intelligent Systems.

[32]  N. Huijboom,et al.  Open data: An International comparison of strategies , 2011 .

[33]  John Carlo Bertot,et al.  Big data and e-government: issues, policies, and recommendations , 2013, DG.O.

[34]  A. Tversky Features of Similarity , 1977 .

[35]  W. Winkler IMPROVED DECISION RULES IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 1993 .

[36]  Reynold Xin,et al.  Apache Spark , 2016 .

[37]  Maria-Esther Vidal,et al.  Classifying Data Heterogeneity within Budget and Spending Open Data , 2018, ICEGOV.

[38]  Maria-Esther Vidal,et al.  OpenBudgets.eu: A Platform for Semantically Representing and Analyzing Open Fiscal Data , 2018, ICWE.