RUBIX: a framework for improving data integration with linked data

With today's public data sets containing billions of data items, more and more companies are looking to integrate external data with their traditional enterprise data to improve business intelligence analysis. These distributed data sources however exhibit heterogeneous data formats and terminologies and may contain noisy data. In this paper, we present RUBIX, a novel framework that enables business users to semi-automatically perform data integration on potentially noisy tabular data. This framework offers an extension to Google Refine with novel schema matching algorithms leveraging Freebase rich types. First experiments show that using Linked Data to map cell values with instances and column headers with types improves significantly the quality of the matching results and therefore should lead to more informed decisions.

[1]  Tim Finin,et al.  Exploiting a Web of Semantic Data for Interpreting Tables , 2010 .

[2]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[3]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[4]  Eric Peukert,et al.  A Self-Configuring Schema Matching System , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[5]  Sunita Sarawagi,et al.  Annotating and searching web tables using entities, types and relationships , 2010, Proc. VLDB Endow..

[6]  D. Boyd,et al.  Six Provocations for Big Data , 2011 .

[7]  Paul A. Watters,et al.  Statistics in a nutshell , 2008 .

[8]  Denilson Barbosa,et al.  Labeling Data Extracted from the Web , 2007, OTM Conferences.

[9]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[10]  T. D. Wilson Review of: Boslaugh, Sarah and Watters, Paul Andrew Statistics in a nutshell. Sebastopol, CA: O'Reilly, 2008 , 2008, Inf. Res..

[11]  Michael J. Hernandez,et al.  Database Design for Mere Mortals: A Hands-On Guide to Relational Database Design (3rd Edition) , 1996 .

[12]  Frederick H. Lochovsky,et al.  Data extraction and label assignment for web databases , 2003, WWW '03.

[13]  Timothy W. Finin,et al.  Using Wikitology for Cross-Document Entity Coreference Resolution , 2009, AAAI Spring Symposium: Learning by Reading and Learning to Read.

[14]  C. Kowalski On the Effects of Non‐Normality on the Distribution of the Sample Product‐Moment Correlation Coefficient , 1972 .

[15]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[16]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[17]  Achille Fokoue,et al.  Helix: online enterprise data analytics , 2011, WWW.

[18]  Renée J. Miller,et al.  Schema Discovery , 2003, IEEE Data Eng. Bull..

[19]  Eric Peukert,et al.  AMC - A framework for modelling and comparing matching systems as matching processes , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[20]  Umberto Straccia,et al.  oMAP: Combining Classifiers for Aligning Automatically OWL Ontologies , 2005, WISE.