An Efficient Algorithm for Data Cleaning

The quality of real world data that is being fed into a data warehouse is a major concern of today. As the data comes from a variety of sources before loading the data in the data warehouse, it must be checked for errors and anomalies. There may be exact duplicate records or approximate duplicate records in the source data. The presence of incorrect or inconsistent data can significantly distort the results of analyses, often negating the potential benefits of information-driven approaches. This paper addresses issues related to detection and correction of such duplicate records. Also, it analyzes data quality and various factors that degrade it. A brief analysis of existing work is discussed, pointing out its major limitations. Thus, a new framework is proposed that is an improvement over the existing technique.

[1]  Abdul Samad Kazi,et al.  Knowledge management in the construction industry : a socio-technical perspective , 2005 .

[2]  Nikos Vrakas,et al.  A Cross Layer Spoofing Detection Mechanism for Multimedia Communication Services , 2011, Int. J. Inf. Technol. Syst. Approach.

[3]  Philip Calvert,et al.  Encyclopedia of Knowledge Management , 2008 .

[4]  Surajit Chaudhuri,et al.  An overview of data warehousing and OLAP technology , 1997, SGMD.

[5]  William E. Winkler,et al.  STATE OF STATISTICAL DATA EDITING AND CURRENT RESEARCH PROBLEMS , 1999 .

[6]  Thomas Redman,et al.  The impact of poor data quality on the typical enterprise , 1998, CACM.

[7]  Mohsen Shafiei Nikabadi,et al.  A Multidimensional Structure for Describing the Influence of Supply Chain Strategies, Business Strategies, and Knowledge Management Strategies on Knowledge Sharing in Supply Chain , 2012, Int. J. Knowl. Manag..

[8]  Susan Hanley,et al.  Creating Knowledge-Based Communities of Practice: Lessons Learned from AMS’s Knowledge Management Initiatives , 2000 .

[9]  L. Pereira,et al.  International Journal of Knowledge-Based Organizations , 2011 .

[10]  A. Dainty,et al.  HRM Strategies for Promoting Knowledge Sharing within Construction Project Organisations: A Case Study , 2005 .

[11]  Ana Maria Ramalho Correia Knowledge Management in Emerging Economies: Social, Organizational and Cultural Implementation , 2012 .

[12]  Christie I. Ezeife,et al.  The Use of Smart Tokens in Cleaning Integrated Warehouse Data , 2005, Int. J. Data Warehous. Min..

[13]  Alireza Isfandyari-Moghaddam Knowledge Management 2.0: Organizational Models and Enterprise Strategies , 2013 .

[14]  J JebamalarTamilselvi,et al.  Detection and elimination of duplicate data using token-based method for a data warehouse: a clustering based approach , 2009 .

[15]  Y. Malhotra Knowledge Management and Virtual Organizations , 2000 .

[16]  Robert B. Mitchell,et al.  A Method for Knowledge Modeling: Application of Unified Modeling Language (UML) to Knowledge Modeling , 2006, Int. J. Knowl. Manag..

[17]  Hongjun Lu,et al.  Cleansing Data for Mining and Warehousing , 1999, DEXA.

[18]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[19]  Ken Orr,et al.  Data quality and systems theory , 1998, CACM.

[20]  Freimut Bodendorf,et al.  Attributive Idea Evaluation: A New Idea Evaluation Method for Corporate Open Innovation Communities , 2012, Int. J. Knowl. Based Organ..

[21]  H. Hannah Inbarani,et al.  Analysis of Click Stream Patterns using Soft Biclustering Approaches , 2011, Int. J. Inf. Technol. Syst. Approach.

[22]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[23]  V. Saravanan,et al.  A Unified Framework and Sequential Data Cleaning Approach for a Data Warehouse , 2008 .

[24]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[25]  Alvaro E. Monge,et al.  Adaptive detection of approximately duplicate database records and the database integration approach to information discovery , 1998 .

[26]  Tian Zhang,et al.  BIRCH: A New Data Clustering Algorithm and Its Applications , 1997, Data Mining and Knowledge Discovery.