A Novel Framework and Model for Data Warehouse Cleansing

ABSTRACT Data cleansing is a process that deals with identification of corrupt and duplicate data inherent in the data sets of a data warehouse to enhance the quality of data. This paper aims to facilitate the data cleaning process by addressing the problem of duplicate records detection pertaining to the „name‟ attributes of the data sets. It provides a sequence of algorithms through a novel framework for identifying duplicity in the „name‟ attribute of the data sets of an already existing data warehouse. The key features of the research includes its proposal of a novel framework through a well defined sequence of algorithms and refining the application of alliance rules [1] by incorporating the use of previously existing and well defined similarity computation measures. The results depicted show the feasibility and validity of the suggested method. Keywords Data warehouse, data cleansing, fuzzy logic, data mining. 1. INTRODUCTION A data warehouse is a subject-oriented, integrated, non - volatile and time- variant collection of data in support of management‟s decisions [2]. It is the central point of data integration for business intelligence. An enterprise generally runs numerous operational systems which is a major source of data, together with external agencies, for its data warehouse. These operational systems store data by individual applications and these applications work more or less on similar entities. For example, in a telecommunication company, applications like customer billing system , customer credit verifying system, customer personal information management system etc. would have many attributes in common, for example - customer name, addresses , but these common attributes may or may not be represented in exactly same formats in their associated databases. In such a scenario, when data is integrated for storage in a data warehouse, unintentional duplication of records created from millions of data entries from heterogeneous sources can hardly be avoided. This leads to redundancies in the data and degradation of its quality. Data Warehouse suffers from dirty data. Dirty data can be understood as the data which is not consistent with already residing data of data warehouse [3,4]. For e.g. missing data can be considered as dirty. The data that is unknown or not clear at the time of data entry is also considered dirty data. Data warehouses, after their construction by merging and amalgamation of numerous operational systems, are also updated periodically to store/accommodate new data. The possibility and probability of dirty data induction increases with more and more regular updates. Even the most carefully and cautiously planned updating cannot give a full-proof protection from dirty data [5].The presence of dirty data leads to many serious problems because data warehouse is used for decision making and strategic planning of an organization. Data cleansing is the most obvious and also, the only viable solution that can address these problems. Cleaning of data is an activity which detects redundancy or unwanted data and corrects it for increasing the quality of the data .The need, therefore, is of an automated data cleaning tool [6] with capabilities of maintaining as well as improving data quality in the data warehouse. It also requires general framework in [7] for data cleaning process as well as some methods that can be used to address the problem like statistical outlier detection, pattern-matching, clustering and data mining techniques. „Name‟ field as given in [8] is one of the most common attribute which is used in query processing. Finding and matching personal names is at the core of an increasingly large no. of applications: from text and web mining, information retrieval, search engines to de-duplication and data linkage systems. A large amount of data in most warehouses pertains to people. Hence, personal names are often encountered being used as search criteria thus, the significance and importance of the personal name field cannot be stressed too strongly. The problems in the name field arises because of several reasons ranging from spelling variation for personal names, use of nicknames, change of names with time etc. It also presents an overview of a comprehensive no. of name matching techniques and their experimental results. In some cases, it is not easy to find out whether a name variation is a different spelling as in [9] of the same name or a different name altogether. There are various types of variations which are: • Spelling variations: These generally include interchanged or mislaid script due to typographical errors, substitute letters (as in Rajiv and Rajeev), supplementary letters (such as Smythe).Basically such variation does not influence the phonetic configuration of the name but then also causes problems in respective names. These variations mainly

[1]  Heiko Mueller,et al.  Problems , Methods , and Challenges in Comprehensive Data Cleansing , 2005 .

[2]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[3]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[4]  Arthur Chapman,et al.  © 2005, Global Biodiversity Information Facility Material in this publication is free to use, with proper attribution. Recommended citation format: Chapman, A. D. 2005. Principles of Data Quality, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen. , 2005 .

[5]  Andrian Marcus,et al.  Ordinal association rules for error identification in data sets , 2001, CIKM '01.

[6]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[7]  Gérard Bouchard,et al.  Name Variations And Computerized Record Linkage , 1980 .

[8]  Andrian Marcus,et al.  Data Cleansing: Beyond Integrity Analysis 1 , 2000 .

[9]  Andrian Marcus,et al.  Automated Identification of Errors in Data Sets , 2000 .

[10]  Thomas Redman,et al.  The impact of poor data quality on the typical enterprise , 1998, CACM.

[11]  Piskorski Jakub Usability of String Distance Metrics for Name Matching Tasks in Polish , 2007 .

[12]  Paulraj Ponniah,et al.  Data warehousing fundamentals : a comprehensive guide for IT professionals , 2001 .

[13]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[14]  Andrian Marcus,et al.  Utilizing Association Rules for the Identification of Errors in Data , 2000 .

[15]  Payal Pahwa,et al.  Alliance Rules for Data Warehouse Cleansing , 2009, 2009 International Conference on Signal Processing Systems.

[16]  Peter Christen,et al.  A Comparison of Personal Name Matching: Techniques and Practical Issues , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[17]  Andrian Marcus,et al.  Data Cleansing: Beyond Integrity Analysis , 2000, IQ.

[18]  Amit Rudra,et al.  Key issues in achieving data quality and consistency in data warehousing among large organisations in Australia , 1999, Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers.

[19]  Marcin Sydow,et al.  String Distance Metrics for Reference Matching and Search Query Correction , 2007, BIS.