E-Clean: A Data Cleaning Framework for Patient Data

We need to prepare quality data by pre-processing the raw data. Data cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data. Data cleaning system are needed to support any changes in the structure, representation or content of data. There are three parts in the cleaning process, i.e. extract the invalid value, matching attributes with valid values and data cleaning algorithm. Our system uses the extract, transform and load model as the system main process model to serve as a guideline for the implementation of the system. Besides that, parsing techniques is also use for the identification of dirty data. The method that we choose for matching attributes is regular expression. Among those data cleaning algorithms, k-Nearest Neighbor algorithm is selected for the data cleaning part of this project because it is simple to understand and easy to implement.

[1]  Bo Sun,et al.  Study on the Improvement of K-Nearest-Neighbor Algorithm , 2009, 2009 International Conference on Artificial Intelligence and Computational Intelligence.

[2]  Ralph Kimball,et al.  The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data , 2004 .

[3]  Das Amrita,et al.  Mining Association Rules between Sets of Items in Large Databases , 2013 .

[4]  William E. Winkler,et al.  STATE OF STATISTICAL DATA EDITING AND CURRENT RESEARCH PROBLEMS , 1999 .

[5]  Pengfei Guo,et al.  The enhanced genetic algorithms for the optimization design , 2010, 2010 3rd International Conference on Biomedical Engineering and Informatics.

[6]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[7]  Shiliang Sun,et al.  An adaptive k-nearest neighbor algorithm , 2010, 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery.

[8]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[9]  Laks V. S. Lakshmanan,et al.  SchemaSQL - A Language for Interoperability in Relational Multi-Database Systems , 1996, VLDB.

[10]  Li Ruimin,et al.  Incident Duration Model on Urban Freeways Based on Classification and Regression Tree , 2009, 2009 Second International Conference on Intelligent Computation Technology and Automation.

[11]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[12]  X. B. Tan,et al.  Test Data Generation Using Annealing Immune Genetic Algorithm , 2009, 2009 Fifth International Joint Conference on INC, IMS and IDC.

[13]  Joseph M. Hellerstein,et al.  Potter''s Wheel: An Interactive Framework for Data Transformation and Cleaning , 2001, VLDB 2001.

[14]  Andrian Marcus,et al.  Data Cleansing: Beyond Integrity Analysis 1 , 2000 .

[15]  Serge Abiteboul,et al.  Tools for Data Translation and Integration , 1999, IEEE Data Eng. Bull..

[16]  Andrian Marcus,et al.  Data Cleansing: Beyond Integrity Analysis , 2000, IQ.