Predictive Data Transformation Suggestions in Grafterizer Using Machine Learning

Data preprocessing is a crucial step in data analysis. A substantial amount of time is spent on data transformation tasks such as data formatting, modification, extraction, and enrichment, typically making it more convenient for users to work with systems that can recommend most relevant transformations for a given dataset. In this paper, we propose an approach for generating relevant data transformation suggestions for tabular data preprocessing using machine learning (specifically, the Random Forest algorithm). The approach is implemented for Grafterizer, a Web-based framework for tabular data cleaning and transformation, and evaluated through a usability study.

[1]  José Augusto Baranauskas,et al.  How Many Trees in a Random Forest? , 2012, MLDM.

[2]  Wei-Min Shen,et al.  Data Preprocessing and Intelligent Data Analysis , 1997, Intell. Data Anal..

[3]  Paul Resnick,et al.  Recommender systems , 1997, CACM.

[4]  Jeffrey Heer,et al.  Enterprise Data Analysis and Visualization: An Interview Study , 2012, IEEE Transactions on Visualization and Computer Graphics.

[5]  Ahmed Eldawy,et al.  NADEEF: a commodity data cleaning system , 2013, SIGMOD '13.

[6]  Dumitru Roman,et al.  Usability of Visual Data Profiling in Data Cleaning and Transformation , 2017, OTM Conferences.

[7]  Dumitru Roman,et al.  DataGraft: Simplifying Open Data Publishing , 2016, ESWC.

[8]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[9]  Christopher Ré,et al.  The HoloClean Framework Dataset to be cleaned Denial Constraints External Information t 1 t 4 t 2 t 3 Johnnyo ’ s , 2017 .

[10]  Guido van Capelleveen,et al.  The recommender canvas: A model for developing and documenting recommender system design , 2019, Expert Syst. Appl..

[11]  Tony Lee,et al.  DataGraft: One-stop-shop for open data management , 2018, Semantic Web.

[12]  Jeffrey Heer,et al.  Predictive Interaction for Data Transformation , 2015, CIDR.

[13]  Joseph M. Hellerstein,et al.  Potter''s Wheel: An Interactive Framework for Data Transformation and Cleaning , 2001, VLDB 2001.

[14]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[15]  Paolo Papotti,et al.  KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing , 2015, SIGMOD Conference.

[16]  Dumitru Roman,et al.  Tabular Data Cleaning and Linked Data Generation with Grafterizer , 2016, ESWC.