A web-based approach to data imputation

In this paper, we present WebPut, a prototype system that adopts a novel web-based approach to the data imputation problem. Towards this, Webput utilizes the available information in an incomplete database in conjunction with the data consistency principle. Moreover, WebPut extends effective Information Extraction (IE) methods for the purpose of formulating web search queries that are capable of effectively retrieving missing values with high accuracy. WebPut employs a confidence-based scheme that efficiently leverages our suite of data imputation queries to automatically select the most effective imputation query for each missing value. A greedy iterative algorithm is proposed to schedule the imputation order of the different missing values in a database, and in turn the issuing of their corresponding imputation queries, for improving the accuracy and efficiency of WebPut. Moreover, several optimization techniques are also proposed to reduce the cost of estimating the confidence of imputation queries at both the tuple-level and the database-level. Experiments based on several real-world data collections demonstrate not only the effectiveness of WebPut compared to existing approaches, but also the efficiency of our proposed algorithms and optimization techniques.

[1]  Marta Indulska,et al.  WebPut: Efficient Web-Based Data Imputation , 2012, WISE.

[2]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[3]  Gustavo E. A. P. A. Batista,et al.  An analysis of four missing data treatment methods for supervised learning , 2003, Appl. Artif. Intell..

[4]  Jerzy W. Grzymala-Busse,et al.  Three Approaches to Missing Attribute Values: A Rough Set Perspective , 2008, Data Mining: Foundations and Practice.

[5]  Zili Zhang,et al.  Missing Value Estimation for Mixed-Attribute Data Sets , 2011, IEEE Transactions on Knowledge and Data Engineering.

[6]  Marc Moens,et al.  Named Entity Recognition without Gazetteers , 1999, EACL.

[7]  Xi Zhang,et al.  Estimating the confidence of conditional functional dependencies , 2009, SIGMOD Conference.

[8]  Chian-Huei Wun,et al.  Using association rules for completing missing data , 2004, Fourth International Conference on Hybrid Intelligent Systems (HIS'04).

[9]  Shichao Zhang,et al.  Parimputation: From Imputation and Null-Imputation to Partially Imputation , 2008, IEEE Intell. Informatics Bull..

[10]  Paola Sebastiani,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. Robust Learning with Missing Data , 2022 .

[11]  D. Rubin,et al.  Small-sample degrees of freedom with multiple imputation , 1999 .

[12]  Jerzy W. Grzymala-Busse,et al.  A Comparison of Several Approaches to Missing Attribute Values in Data Mining , 2000, Rough Sets and Current Trends in Computing.

[13]  Laurianne Sitbon,et al.  Learning-based relevance feedback for web-based relation completion , 2011, CIKM '11.

[14]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[15]  J. N. K. Rao,et al.  Empirical likelihood-based inference under imputation for missing response data , 2002 .

[16]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[17]  Rahul Gupta,et al.  Answering Table Augmentation Queries from Unstructured Lists on the Web , 2009, Proc. VLDB Endow..

[18]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[19]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[20]  Shichao Zhang,et al.  Shell-neighbor method and its application in missing data imputation , 2011, Applied Intelligence.

[21]  William W. Cohen,et al.  Automatic Set Instance Extraction using the Web , 2009, ACL/IJCNLP.

[22]  William W. Cohen,et al.  Iterative Set Expansion of Named Entities Using the Web , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[23]  Jiye Li,et al.  Assigning missing attribute values based on rough sets theory , 2006, 2006 IEEE International Conference on Granular Computing.

[24]  Jerzy W. Grzymala-Busse,et al.  Coping With Missing Attribute Values Based on Closest Fit in Preterm Birth Data: A Rough Set Approach , 2001, Comput. Intell..

[25]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[26]  Xiaojie Yuan,et al.  Corpus-based Semantic Class Mining: Distributional vs. Pattern-Based Approaches , 2010, COLING.

[27]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.