Journal of Engineering and Technology for Industrial Applications

Article History Received: August 12, 2020 Accepted: October 19, 2020 Published: October 30, 2020 One of the great challenges to obtaining knowledge from data sources is to ensure consistency and non-duplication of stored information. Many techniques have been proposed to minimize the work cost and to allow data to be analyzed and properly corrected. However, there are still other essential aspects for the success of data cleaning process that involve many technological areas: performance, semantic and autonomy of the process. Against this backdrop, we developed an automated configurable data cleaning environment based on training and physical-semantic data similarity, aiming to provide a more efficient and extensible tool for performing information correction which covers problems not yet explored such as semantic and autonomy of the cleaning implementation process. The developed work has, among its objectives, the reduction of user interaction in the process of analyzing and correcting data inconsistencies and duplications. With a properly calibrated environment, the efficiency is significant, covering approximately 90% of inconsistencies in the database, with a 0% percentage of false-positive cases. Approaches were also demonstrated to show that besides detecting and treating information inconsistencies and duplication of positive cases, they also addressed cases of detected false-positives and the negative impacts they may have on the data cleaning process, whether manual or automated, which is not yet widely discussed in literature. The most significant contribution of this work refers to the developed tool that, without user interaction, is automatically able to analyze and eliminate 90% of the inconsistencies and duplications of information contained in a database, with no occurrence of false-positives. The results of the tests proved the effectiveness of all the developed features, relevant to each module of the proposed architecture. In several scenarios the experiments demonstrated the effectiveness of the tool.

[1]  Mohit Bajaj,et al.  An analytic hierarchy process-based novel approach for benchmarking the power quality performance of grid-integrated renewable energy systems , 2020 .

[2]  Vitor Hugo Ferreira,et al.  A review on optimization methods for workforce planning in electrical distribution utilities , 2019, Comput. Ind. Eng..

[3]  Christopher Beckham,et al.  WekaDeeplearning4j: A deep learning package for Weka based on Deeplearning4j , 2019, Knowl. Based Syst..

[4]  Richard Mansfield,et al.  Mastering VBA for Microsoft® Office 365® , 2019 .

[5]  Md Mominul Islam,et al.  A review of condition monitoring techniques and diagnostic tests for lifetime estimation of power transformers , 2017, Electrical Engineering.

[6]  M. Z. Fortes,et al.  Power quality — Regulation of residential electrical loads , 2018, 2018 Simposio Brasileiro de Sistemas Eletricos (SBSE).

[7]  Solon P. Pissis,et al.  A faster and more accurate heuristic for cyclic edit distance computation , 2017, Pattern Recognit. Lett..

[8]  O. I. Khristodulo,et al.  Use algorithm Based at Hamming Neural Network Method for Natural Objects Classification , 2017 .

[9]  Fabio A. González,et al.  Mathematical properties of soft cardinality: Enhancing Jaccard, Dice and cosine similarity measures with element-wise distance , 2016, Inf. Sci..

[10]  Andres Felipe Rojas Hernandez,et al.  Distributed processing using cosine similarity for mapping Big Data in Hadoop , 2016 .

[11]  Shashi Shekhar,et al.  Identifying K Primary Corridors from urban bicycle GPS trajectories on a road network , 2016, Inf. Syst..

[12]  L. A. E. Silva,et al.  A Data Mining Approach for Standardization of Collectors Names in Herbarium Database , 2016, IEEE Latin America Transactions.

[13]  Mostafa Belkasmi,et al.  The filtered combination of the weighted edit distance and the Jaro-Winkler distance to improve spellchecking Arabic texts , 2015, 2015 IEEE/ACS 12th International Conference of Computer Systems and Applications (AICCSA).

[14]  Atsuyoshi Nakamura,et al.  Average-case linear-time similar substring searching by the q-gram distance , 2014, Theor. Comput. Sci..

[15]  Kai-Yuan Cai,et al.  Mutation-oriented test data augmentation for GUI software fault localization , 2013, Inf. Softw. Technol..

[16]  Ge Yu,et al.  Efficiently Indexing Large Sparse Graphs for Similarity Search , 2012, IEEE Transactions on Knowledge and Data Engineering.

[17]  Carlos Roberto Valêncio,et al.  Optimization of Algorithm to Identification of Duplicate Tuples through Similarity Phonetic Based on Multithreading , 2011, 2011 12th International Conference on Parallel and Distributed Computing, Applications and Technologies.

[18]  Divesh Srivastava,et al.  Discovery of complex glitch patterns: A novel approach to Quantitative Data Cleaning , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[19]  L. Venkata Subramaniam,et al.  Data Cleansing Techniques for Large Enterprise Datasets , 2011, 2011 Annual SRII Global Conference.

[20]  L. Venkata Subramaniam,et al.  Optimal Training Data Selection for Rule-Based Data Cleansing Models , 2011, 2011 Annual SRII Global Conference.

[21]  A. K. Mandal,et al.  Developing an efficient search suggestion generator, ignoring spelling error for high speed data retrieval using Double Metaphone Algorithm , 2010, 2010 13th International Conference on Computer and Information Technology (ICCIT).

[22]  Junjie Wu,et al.  TOP-K cosine similarity interesting pairs search , 2010, 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery.

[23]  Laks V. S. Lakshmanan,et al.  Data Cleaning and Query Answering with Matching Dependencies and Matching Functions , 2010, ICDT '11.

[24]  Kamran Ali,et al.  A framework to implement data cleaning in enterprise data warehouse for robust data quality , 2010, 2010 International Conference on Information and Emerging Technologies.

[25]  Wang Heyong,et al.  Notice of RetractionThe Research of Outlier Data Cleaning through Relevance Comparison , 2010, 2010 2nd International Conference on E-business and Information System Security.

[26]  M. Madrigal,et al.  Active correlation technology applied to maintenance program based on voltage sags control , 2010 .

[27]  Liping Di,et al.  Data cleaning approaches in Web2.0 VGI application , 2009, 2009 17th International Conference on Geoinformatics.

[28]  Ge Yu,et al.  Case Study on Modeling Approaches and Framework of Scientific Data Cleaning , 2009, 2009 Ninth IEEE International Conference on Computer and Information Technology.

[29]  Huang Yu,et al.  A universal data cleaning framework based on user model , 2009, 2009 ISECS International Colloquium on Computing, Communication, Control, and Management.

[30]  Pedro Rangel Henriques,et al.  SmartClean: An Incremental Data Cleaning Tool , 2009, 2009 Ninth International Conference on Quality Software.

[31]  Tsuyoshi Okita,et al.  Data Cleaning for Word Alignment , 2009, ACL.

[32]  Raghav Kaushik,et al.  A grammar-based entity representation framework for data cleaning , 2009, SIGMOD Conference.

[33]  Hao Yan,et al.  Research on Information Quality Driven Data Cleaning Framework , 2008, 2008 International Seminar on Future Information Technology and Management Engineering.

[34]  Lukasz Ciszak,et al.  Application of clustering and association methods in data cleaning , 2008, 2008 International Multiconference on Computer Science and Information Technology.

[35]  Slobodan Petrovic,et al.  Improving the Efficiency of Misuse Detection by Means of the q-gram Distance , 2008, 2008 The Fourth International Conference on Information Assurance and Security.

[36]  Byung-Ryul Ahn,et al.  Plagiarism Detection Using the Levenshtein Distance and Smith-Waterman Algorithm , 2008, 2008 3rd International Conference on Innovative Computing Information and Control.

[37]  J. Eto,et al.  Understanding the cost of power interruptions to U.S. electricity consumers , 2004 .

[38]  Luis Gravano,et al.  Text joins in an RDBMS for web data integration , 2003, WWW '03.

[39]  Ahmed K. Elmagarmid,et al.  TAILOR: a record linkage toolbox , 2002, Proceedings 18th International Conference on Data Engineering.

[40]  David O. Holmes,et al.  Improving precision and recall for Soundex retrieval , 2002, Proceedings. International Conference on Information Technology: Coding and Computing.

[41]  William W. Cohen Data integration using similarity joins and a word-based information representation language , 2000, TOIS.

[42]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[43]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[44]  Pavels Osipovs,et al.  Classification tree applying for automated CV filtering in transport company , 2019, Procedia Computer Science.

[45]  Marcio Zamboti Fortes,et al.  Harmonic Analysis of a Photovoltaic Systems Connected to Low Voltage Grid , 2018, IEEE Latin America Transactions.

[46]  A. H. M. T. Hofstedea,et al.  Event log imperfection patterns for process mining : Towards a systematic approach to cleaning event logs , 2016 .

[47]  A. F. Rojas,et al.  Distributed processing using cosine similarity for mapping Big Data in Hadoop , 2016, IEEE Latin America Transactions.

[48]  Hao-Ren Ke,et al.  Concept extraction and clustering for search result organization and virtual community construction , 2012, Comput. Sci. Inf. Syst..

[49]  Yiannis S. Boutalis,et al.  A new method for constructing kernel vectors in morphological associative memories of binary patterns , 2011, Comput. Sci. Inf. Syst..

[50]  Xiu-yu Zhong,et al.  The Research And Application of Web Log Mining Based on the Platform Weka , 2011 .

[51]  Shuyu Li,et al.  The Design and Implementation of Dynamic Data Cleaning Modeling , 2010 .

[52]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[53]  H. Do,et al.  Data Cleaning: Problems and Current Approaches , 2000 .