TRIP: An Interactive Retrieving-Inferring Data Imputation Approach

Data imputation aims at filling in missing attribute values in databases. Most existing imputation methods to string attribute values are inferring-based approaches, which usually fail to reach a high imputation recall by just inferring missing values from the complete part of the data set. Recently, some retrieving-based methods are proposed to retrieve missing values from external resources such as the World Wide Web, which tend to reach a much higher imputation recall, but inevitably bring a large overhead by issuing a large number of search queries. In this paper, we investigate the interaction between the inferring-based methods and the retrieving-based methods. We show that retrieving a small number of selected missing values can greatly improve the imputation recall of the inferring-based methods. With this intuition, we propose an inTeractive Retrieving-Inferring data imPutation approach (TRIP), which performs retrieving and inferring alternately in filling in missing attribute values in a data set. To ensure the high recall at the minimum cost, TRIP faces a challenge of selecting the least number of missing values for retrieving to maximize the number of inferable values. Our proposed solution is able to identify an optimal retrieving-inferring scheduling scheme in deterministic data imputation, and the optimality of the generated scheme is theoretically analyzed with proofs. We also analyze with an example that the optimal scheme is not feasible to be achieved in $\tau$ -constrained stochastic data imputation ( $\tau$ -SDI), but still, our proposed solution identifies an expected-optimal scheme in $\tau$ -SDI. Extensive experiments on four data collections show that TRIP retrieves on average 20 percent missing values and achieves the same high recall that was reached by the retrieving-based approach.

[1]  Wenfei Fan,et al.  Conditional Functional Dependencies for Data Cleaning , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[2]  Bing Yu,et al.  Missing data analyses: a hybrid multiple imputation algorithm using Gray System Theory and entropy based on clustering , 2013, Applied Intelligence.

[3]  Surajit Chaudhuri,et al.  InfoGather: entity augmentation and attribute discovery by holistic matching with web tables , 2012, SIGMOD Conference.

[4]  Xiaoyong Du,et al.  AML: Efficient Approximate Membership Localization within a Web-Based Join Framework , 2013, IEEE Transactions on Knowledge and Data Engineering.

[5]  Rajeev Rastogi,et al.  A cost-based model and effective heuristic for repairing constraints by value modification , 2005, SIGMOD '05.

[6]  Subbarao Kambhampati,et al.  Mining approximate functional dependencies and concept similarities to answer imprecise queries , 2004, WebDB '04.

[7]  D. Rubin,et al.  Small-sample degrees of freedom with multiple imputation , 1999 .

[8]  Marta Indulska,et al.  WebPut: Efficient Web-Based Data Imputation , 2012, WISE.

[9]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[10]  Chian-Huei Wun,et al.  Using association rules for completing missing data , 2004, Fourth International Conference on Hybrid Intelligent Systems (HIS'04).

[11]  Shichao Zhang,et al.  The Journal of Systems and Software , 2012 .

[12]  J. N. K. Rao,et al.  Empirical likelihood-based inference under imputation for missing response data , 2002 .

[13]  Alon Y. Halevy,et al.  Data Integration for the Relational Web , 2009, Proc. VLDB Endow..

[14]  Jerzy W. Grzymala-Busse,et al.  A Comparison of Several Approaches to Missing Attribute Values in Data Mining , 2000, Rough Sets and Current Trends in Computing.

[15]  Marta Indulska,et al.  A web-based approach to data imputation , 2013, World Wide Web.

[16]  Wenfei Fan,et al.  Conditional functional dependencies for capturing data inconsistencies , 2008, TODS.

[17]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[18]  Dorit S. Hochbaum,et al.  Approximation Algorithms for NP-Hard Problems , 1996 .

[19]  Gustavo E. A. P. A. Batista,et al.  An analysis of four missing data treatment methods for supervised learning , 2003, Appl. Artif. Intell..

[20]  Jayant Madhavan,et al.  Harvesting Relational Tables from Lists on the Web , 2009, Proc. VLDB Endow..

[21]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[22]  Xiaofeng Zhu,et al.  Missing data imputation by utilizing information within incomplete instances , 2011, J. Syst. Softw..

[23]  Chao-Ying Joanne Peng,et al.  Comparison of Two Approaches for Handling Missing Covariates in Logistic Regression , 2008 .

[24]  Jerzy W. Grzymala-Busse,et al.  Coping With Missing Attribute Values Based on Closest Fit in Preterm Birth Data: A Rough Set Approach , 2001, Comput. Intell..

[25]  John G. Kovar,et al.  Imputation of Business Survey Data , 2011 .

[26]  Shichao Zhang,et al.  Shell-neighbor method and its application in missing data imputation , 2011, Applied Intelligence.

[27]  Zili Zhang,et al.  Missing Value Estimation for Mixed-Attribute Data Sets , 2011, IEEE Transactions on Knowledge and Data Engineering.

[28]  Aravind Kalavagattu MINING APPROXIMATE FUNCTIONAL DEPENDENCIES AS CONDENSED REPRESENTATIONS OF ASSOCIATION RULES , 2008 .

[29]  Jerzy W. Grzymala-Busse,et al.  Three Approaches to Missing Attribute Values: A Rough Set Perspective , 2008, Data Mining: Foundations and Practice.

[30]  Shichao Zhang,et al.  Noisy data elimination using mutual k-nearest neighbor for classification mining , 2012, J. Syst. Softw..

[31]  Stef van Buuren,et al.  Flexible Imputation of Missing Data , 2012 .

[32]  Eric Crestan,et al.  Web-Scale Distributional Similarity and Entity Set Expansion , 2009, EMNLP.

[33]  Subbarao Kambhampati,et al.  SMARTINT: using mined attribute dependencies to integrate fragmented web databases , 2011, Journal of Intelligent Information Systems.

[34]  Xiaoyong Du,et al.  CoRE: A Context-Aware Relation Extraction Method for Relation Completion , 2013, IEEE Transactions on Knowledge and Data Engineering.

[35]  Chin-Chen Chang,et al.  Combined association rules for dealing with missing values , 2007, J. Inf. Sci..

[36]  Rahul Gupta,et al.  Answering Table Augmentation Queries from Unstructured Lists on the Web , 2009, Proc. VLDB Endow..

[37]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.