论文信息 - A Framework to Detect Disguised Missing Data

A Framework to Detect Disguised Missing Data

Many manually populated very large databases suffer from data quality problems such as missing, inaccurate data and duplicate entries. A recently recognized data quality problem is that of disguised missing data which arises when an explicit code for missing data such as NA (Not Available) is not provided and a legitimate data value is used instead. Presence of these values may affect the outcome of data mining tasks severely such that association mining algorithms or clustering techniques may result in biased inaccurate association rules and invalid clusters respectively. Detection and elimination of these values are necessary but burdensome to be carried out manually. In this chapter, the methods to detect disguised missing values by visual inspection are explained first. Then, the authors describe the methods used to detect these values automatically. Finally, the framework to detect disguised missing data is proposed and a demonstration of the framework on spatial and categorical data sets is provided.

Tugba Taskaya Temizel | Rahime Belen

[1] Comparisons and Applications of Quantitative Signal Detections for Adverse Drug Reactions (ADRs): An Empirical Study Based On The Food And Drug Administration (FDA) Adverse Event Reporting System (AERS) And A Large Medical Claims Database , 2008 .

[2] Fuhui Long,et al. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3] Ahad Zare Ravasan,et al. A Novel Hybrid Algorithm Based on K-Means and Evolutionary Computations for Real Time Clustering , 2014, Int. J. Data Warehous. Min..

[4] Theodore Johnson,et al. Exploratory Data Mining and Data Cleaning , 2003 .

[5] Basar Oztaysi,et al. User Segmentation Based on Twitter Data Using Fuzzy Clustering , 2013 .

[6] Jian Pei,et al. Cleaning disguised missing data: a heuristic approach , 2007, KDD '07.

[7] Pedro Larrañaga,et al. Gaussian-Stacking Multiclassifiers for Human Embryo Selection , 2009 .

[8] Ronald K. Pearson,et al. The problem of disguised missing data , 2006, SKDD.

[9] Jerzy W. Grzymala-Busse,et al. A Comparison of Several Approaches to Missing Attribute Values in Data Mining , 2000, Rough Sets and Current Trends in Computing.

[10] Ulrich Güntzer,et al. Data Quality Mining - Making a Virute of Necessity , 2001, DMKD.