Data Quality Assessment via Robust Clustering

Although data mining is popularly used in business and industry to improve the quality of decision making, data quality is long time ignored in many practices so that the analytical results derived by data mining methods are usually questionable and unreliable to represent useful knowledge and aid decision making. This paper proposed a generic framework for data quality assessment in nonhomogeneous environments based on robust clustering analysis. In particular, trimmed clustering methods are proposed to robustly characterize groups of similar observations and trimmed observations are then evaluated to assess outlying-ness based on their distance with the cluster profiles. Simulation studies have shown the eectiveness of the proposed framework.