WebPut: Efficient Web-Based Data Imputation

In this paper, we present WebPut, a prototype system that adopts a novel web-based approach to the data imputation problem. Towards this, Webput utilizes the available information in an incomplete database in conjunction with the data consistency principle. Moreover, WebPut extends effective Information Extraction (IE) methods for the purpose of formulating web search queries that are capable of effectively retrieving missing values with high accuracy. WebPut employs a confidence-based scheme that efficiently leverages our suite of data imputation queries to automatically select the most effective imputation query for each missing value. A greedy iterative algorithm is also proposed to schedule the imputation order of the different missing values in a database, and in turn the issuing of their corresponding imputation queries, for improving the accuracy and efficiency of WebPut. Experiments based on several real-world data collections demonstrate that WebPut outperforms existing approaches.

[1]  Luis Gravano,et al.  Extracting Relations from Large Plain-Text Collections , 1999 .

[2]  Gustavo E. A. P. A. Batista,et al.  An analysis of four missing data treatment methods for supervised learning , 2003, Appl. Artif. Intell..

[3]  J. N. K. Rao,et al.  Empirical likelihood-based inference under imputation for missing response data , 2002 .

[4]  Chian-Huei Wun,et al.  Using association rules for completing missing data , 2004, Fourth International Conference on Hybrid Intelligent Systems (HIS'04).

[5]  Xiaojie Yuan,et al.  Corpus-based Semantic Class Mining: Distributional vs. Pattern-Based Approaches , 2010, COLING.

[6]  Jerzy W. Grzymala-Busse,et al.  Three Approaches to Missing Attribute Values: A Rough Set Perspective , 2008, Data Mining: Foundations and Practice.

[7]  Shichao Zhang,et al.  Parimputation: From Imputation and Null-Imputation to Partially Imputation , 2008, IEEE Intell. Informatics Bull..

[8]  Sadaaki Miyamoto,et al.  Rough Sets and Current Trends in Computing , 2012, Lecture Notes in Computer Science.

[9]  Marc Moens,et al.  Named Entity Recognition without Gazetteers , 1999, EACL.

[10]  Paola Sebastiani,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. Robust Learning with Missing Data , 2022 .

[11]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[12]  Jiye Li,et al.  Assigning missing attribute values based on rough sets theory , 2006, 2006 IEEE International Conference on Granular Computing.

[13]  Jerzy W. Grzymala-Busse,et al.  Coping With Missing Attribute Values Based on Closest Fit in Preterm Birth Data: A Rough Set Approach , 2001, Comput. Intell..

[14]  Zili Zhang,et al.  Missing Value Estimation for Mixed-Attribute Data Sets , 2011, IEEE Transactions on Knowledge and Data Engineering.

[15]  D. Rubin,et al.  Small-sample degrees of freedom with multiple imputation , 1999 .

[16]  Jerzy W. Grzymala-Busse,et al.  A Comparison of Several Approaches to Missing Attribute Values in Data Mining , 2000, Rough Sets and Current Trends in Computing.

[17]  Laurianne Sitbon,et al.  Learning-based relevance feedback for web-based relation completion , 2011, CIKM '11.

[18]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[19]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[20]  Rahul Gupta,et al.  Answering Table Augmentation Queries from Unstructured Lists on the Web , 2009, Proc. VLDB Endow..

[21]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[22]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[23]  William W. Cohen,et al.  Automatic Set Instance Extraction using the Web , 2009, ACL/IJCNLP.

[24]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[25]  Shichao Zhang,et al.  Shell-neighbor method and its application in missing data imputation , 2011, Applied Intelligence.