Towards a Hybrid Imputation Approach Using Web Tables

Data completeness is one of the most important data quality dimensions and an essential premise in data analytics. With new emerging Big Data trends such as the data lake concept, which provides a low cost data preparation repository instead of moving curated data into a data warehouse, the problem of data completeness is additionally reinforced. While traditionally the process of filling in missing values is addressed by the data imputation community using statistical techniques, we complement these approaches by using external data sources from the data lake or even the Web to lookup missing values. In this paper we propose a novel hybrid data imputation strategy that, takes into account the characteristics of an incomplete dataset and based on that chooses the best imputation approach, i.e. either a statistical approach such as regression analysis or a Web-based lookup or a combination of both. We formalize and implement both imputation approaches, including a Web table retrieval and matching system and evaluate them extensively using a corpus with 125M Web tables. We show that applying statistical techniques in conjunction with external data sources will lead to a imputation system which is robust, accurate, and has high coverage at the same time.

[1]  Surajit Chaudhuri,et al.  InfoGather: entity augmentation and attribute discovery by holistic matching with web tables , 2012, SIGMOD Conference.

[2]  Chao Liu,et al.  FACTO: a fact lookup engine based on web tables , 2011, WWW.

[3]  Alon Y. Halevy,et al.  Data Integration for the Relational Web , 2009, Proc. VLDB Endow..

[4]  Joseph M. Hellerstein,et al.  MAD Skills: New Analysis Practices for Big Data , 2009, Proc. VLDB Endow..

[5]  Meihui Zhang,et al.  InfoGather+: semantic matching and annotation of numeric and time-varying attributes in web tables , 2013, SIGMOD '13.

[6]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[7]  D. Rubin,et al.  Small-sample degrees of freedom with multiple imputation , 1999 .

[8]  Marta Indulska,et al.  WebPut: Efficient Web-Based Data Imputation , 2012, WISE.

[9]  Donald P. Ballou,et al.  Modeling Data and Process Quality in Multi-Input, Multi-Output Information Systems , 1985 .

[10]  Leonardo Franco,et al.  Missing data imputation using statistical and machine learning methods in a real breast cancer problem , 2010, Artif. Intell. Medicine.

[11]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[12]  Wolfgang Lehner,et al.  Top-k entity augmentation using consistent set covering , 2015, SSDBM.

[13]  Jerzy W. Grzymala-Busse,et al.  A Comparison of Several Approaches to Missing Attribute Values in Data Mining , 2000, Rough Sets and Current Trends in Computing.

[14]  Robert P. Goldman,et al.  Imputation of Missing Data Using Machine Learning Techniques , 1996, KDD.

[15]  Wolfgang Lehner,et al.  DrillBeyond: Enabling Business Analysts to Explore the Web of Open Data , 2012, Proc. VLDB Endow..

[16]  Wolfgang Lehner,et al.  DeExcelerator: a framework for extracting relational data from partially structured documents , 2013, CIKM.

[17]  Sunita Sarawagi,et al.  Open-domain quantity queries on web tables: annotation, response, and consensus models , 2014, KDD.

[18]  Eric Crestan,et al.  Web-scale table census and classification , 2011, WSDM '11.