Crowdsourcing-Enhanced Missing Values Imputation Based on Bayesian Network

Due to development of the Internet, the size of data continue to be large and rough. During the process of data collection, different kinds of data problems occurred, among where incompleteness is one of the most serious problems to deal with. The existing methods for missing values imputation have mostly relied on using statistics and machine learning. These methods are known to be limited in efficiency and accuracy, which are caused by high dimensional calculation and low quality of initial data. In this paper, we propose a new method combining Bayesian network and crowdsourcing to deal with missing values together. We use Bayesian network to inference missing values to improve efficiency while use crowdsourcing to obtain additional information in need to improve accuracy. Experiments on real datasets show that our methods achieve better performance compared to other imputation methods.

[1]  Adnan Darwiche,et al.  Inference in belief networks: A procedural guide , 1996, Int. J. Approx. Reason..

[2]  Tim Kraska,et al.  CrowdER: Crowdsourcing Entity Resolution , 2012, Proc. VLDB Endow..

[3]  Hector Garcia-Molina,et al.  Question Selection for Crowd Entity Resolution , 2013, Proc. VLDB Endow..

[4]  Lei Chen,et al.  Reducing Uncertainty of Schema Matching via Crowdsourcing , 2013, Proc. VLDB Endow..

[5]  Lusheng Wang,et al.  Fast accurate missing SNP genotype local imputation , 2012, BMC Research Notes.

[6]  L. Lin,et al.  A concordance correlation coefficient to evaluate reproducibility. , 1989, Biometrics.

[7]  Krzysztof Z. Gajos,et al.  Platemate: crowdsourcing nutritional analysis from food photographs , 2011, UIST.

[8]  Guang Deng,et al.  Kernel PCA regression for missing data estimation in DNA microarray analysis , 2009, 2009 IEEE International Symposium on Circuits and Systems.

[9]  Jianzhong Li,et al.  Missing Values Estimation in Microarray Data with Partial Least Squares Regression , 2006, International Conference on Computational Science.

[10]  Robert P. Goldman,et al.  Imputation of Missing Data Using Machine Learning Techniques , 1996, KDD.

[11]  Gregory F. Cooper,et al.  A Bayesian method for the induction of probabilistic networks from data , 1992, Machine Learning.

[12]  Qingxia Chen,et al.  Missing covariate data in medical research: to impute is better than to ignore. , 2010, Journal of clinical epidemiology.

[13]  Stefanie Nowak,et al.  How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation , 2010, MIR '10.

[14]  Yingshu Li,et al.  Using crowdsourced data in location-based social networks to explore influence maximization , 2016, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[15]  Shichao Zhang,et al.  Shell-neighbor method and its application in missing data imputation , 2011, Applied Intelligence.

[16]  S. Lauritzen The EM algorithm for graphical association models with missing data , 1995 .

[17]  Constantin F. Aliferis,et al.  The max-min hill-climbing Bayesian network structure learning algorithm , 2006, Machine Learning.

[18]  Ahmad Fadzil M. Hani,et al.  Missing Attribute Value Prediction Based on Artificial Neural Network and Rough Set Theory , 2008, 2008 International Conference on BioMedical Engineering and Informatics.

[19]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[20]  Guohui Lin,et al.  Iterated Local Least Squares Microarray Missing Value Imputation , 2006, J. Bioinform. Comput. Biol..

[21]  D. Hochbaum Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems , 1996 .

[22]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[23]  M. Scanu,et al.  Bayesian networks for imputation , 2004 .

[24]  Xiao-Bai Li A Bayesian Approach for Estimating and Replacing Missing Categorical Data , 2009, JDIQ.

[25]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .