How to Address the Data Quality Issues in Regression Models: A Guided Process for Data Cleaning

Today, data availability has gone from scarce to superabundant. Technologies like IoT, trends in social media and the capabilities of smart-phones are producing and digitizing lots of data that was previously unavailable. This massive increase of data creates opportunities to gain new business models, but also demands new techniques and methods of data quality in knowledge discovery, especially when the data comes from different sources (e.g., sensors, social networks, cameras, etc.). The data quality process of the data set proposes conclusions about the information they contain. This is increasingly done with the aid of data cleaning approaches. Therefore, guaranteeing a high data quality is considered as the primary goal of the data scientist. In this paper, we propose a process for data cleaning in regression models (DC-RM). The proposed data cleaning process is evaluated through a real datasets coming from the UCI Repository of Machine Learning Databases. With the aim of assessing the data cleaning process, the dataset that is cleaned by DC-RM was used to train the same regression models proposed by the authors of UCI datasets. The results achieved by the trained models with the dataset produced by DC-RM are better than or equal to that presented by the datasets’ authors.

[1]  G. Hartvigsen,et al.  Secondary Use of EHR: Data Quality Issues and Informatics Opportunities , 2010, Summit on translational bioinformatics.

[2]  D. Cox Karl Pearson and the Chi-Squared Test , 2002 .

[3]  William Frawley,et al.  Knowledge Discovery in Databases , 1991 .

[4]  Dennis Shasha,et al.  An extensible Framework for Data Cleaning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[5]  Rachida Dssouli,et al.  Big Data Pre-processing: A Quality Framework , 2015, 2015 IEEE International Congress on Big Data.

[6]  Paulo Rita,et al.  Predicting social media performance metrics and evaluation of the impact on brand building: A data mining approach , 2016 .

[7]  Friedrich Faubel,et al.  Bounded conditional mean imputation with Gaussian mixture models: A reconstruction approach to partly occluded features , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Andrew P. Reimer,et al.  Data quality assessment framework to assess electronic medical record data for use in research , 2016, Int. J. Medical Informatics.

[9]  Juan Carlos Corrales,et al.  Feature selection for classification tasks: Expert knowledge or traditional methods? , 2018, J. Intell. Fuzzy Syst..

[10]  Nicolette de Keizer,et al.  Model Formulation: Defining and Improving Data Quality in Medical Registries: A Literature Review, Case Study, and Generic Framework , 2002, J. Am. Medical Informatics Assoc..

[11]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[12]  Max Kuhn,et al.  The caret Package , 2007 .

[13]  Lilly Suriani Affendey,et al.  A Framework to Construct Data Quality Dimensions Relationships , 2013 .

[14]  Nitesh V. Chawla,et al.  Information Gain, Correlation and Support Vector Machines , 2006, Feature Extraction.

[15]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[16]  David L. Olson,et al.  Advanced Data Mining Techniques , 2008 .

[17]  Hadi Fanaee-T,et al.  Event labeling combining ensemble detectors and background knowledge , 2014, Progress in Artificial Intelligence.

[18]  Azuraliza Abu Bakar,et al.  A review of feature selection techniques in sentiment analysis , 2019, Intell. Data Anal..

[19]  L. Ladha,et al.  FEATURE SELECTION METHODS AND ALGORITHMS , 2011 .

[20]  Krisztian Buza,et al.  Feedback Prediction for Blogs , 2012, GfKl.

[21]  Rachel Schutt,et al.  Doing Data Science , 2013 .

[22]  Matteo Magnani,et al.  Techniques for Dealing with Missing Data in Knowledge Discovery Tasks , 2004 .

[23]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[24]  Sandra de F. Mendes Sampaio,et al.  DQ2S - A framework for data quality-aware information management , 2015, Expert Syst. Appl..

[25]  Anirban Mahanti,et al.  Traffic classification using clustering algorithms , 2006, MineNet '06.

[26]  Wang Li,et al.  An Object-Oriented Framework for Data Quality Management of Enterprise Data Warehouse , 2006, PRICAI.

[27]  Samina Khalid,et al.  A survey of feature selection and feature extraction techniques in machine learning , 2014, 2014 Science and Information Conference.

[28]  J. Steiner,et al.  A pragmatic framework for single-site and multisite data quality assessment in electronic health record-based clinical research. , 2012, Medical care.

[29]  Maydanchik Arkady,et al.  Data Quality Assessment , 2008 .

[30]  Laura Sebastian-Coleman,et al.  Measuring Data Quality for Ongoing Improvement: A Data Quality Assessment Framework , 2012 .

[31]  Adriana da Silva Jacinto,et al.  Automatic and semantic pre — Selection of features using ontology for data mining on data sets related to cancer , 2014, International Conference on Information Society (i-Society 2014).

[32]  Laure Berti-Équille,et al.  Measuring and Modelling Data Quality for Quality-Awareness in Data Mining , 2007, Quality Measures in Data Mining.

[33]  Yosef Jabareen,et al.  Building a Conceptual Framework: Philosophy, Definitions, and Procedure , 2009 .

[34]  Per Myrseth,et al.  A data quality framework applied to e-government metadata: A prerequsite to establish governance of interoperable e-services , 2011, 2011 International Conference on E-Business and E-Government (ICEE).

[35]  Luis M. Candanedo,et al.  Data driven prediction models of energy use of appliances in a low-energy house , 2017 .

[36]  Hai Jin,et al.  Duplicate Records Cleansing with Length Filtering and Dynamic Weighting , 2008, 2008 Fourth International Conference on Semantics, Knowledge and Grid.

[37]  Lazaros G. Papageorgiou,et al.  A regression tree approach using mathematical programming , 2017, Expert Syst. Appl..

[38]  Hans-Peter Kriegel,et al.  Angle-based outlier detection in high-dimensional data , 2008, KDD.

[39]  Hairong Yu,et al.  Structured data quality reports to improve EHR data quality , 2015, Int. J. Medical Informatics.

[40]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[41]  Marco Torchiano,et al.  Open data quality measurement framework: Definition and application to Open Government Data , 2016, Gov. Inf. Q..

[42]  Huan Liu,et al.  Feature selection for classification: A review , 2014 .

[43]  Lior Rokach,et al.  Introduction to Knowledge Discovery in Databases , 2005, The Data Mining and Knowledge Discovery Handbook.

[44]  Shang-Liang Chen,et al.  Orthogonal least squares learning algorithm for radial basis function networks , 1991, IEEE Trans. Neural Networks.

[45]  Juan Carlos Corrales,et al.  A Conceptual Framework for Data Quality in Knowledge Discovery Tasks (FDQ-KDT): A Proposal , 2015, J. Comput..

[46]  Peter Dalgaard,et al.  R Development Core Team (2010): R: A language and environment for statistical computing , 2010 .

[47]  Andreas Prinz,et al.  A framework for data quality handling in enterprise service bus , 2013, Third International Conference on Innovative Computing Technology (INTECH 2013).

[48]  Douglas M. Hawkins Identification of Outliers , 1980, Monographs on Applied Probability and Statistics.

[49]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[50]  Amparo Alonso-Betanzos,et al.  Filter Methods for Feature Selection - A Comparative Study , 2007, IDEAL.

[51]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[52]  Dinesh Kumar,et al.  Comment Volume Prediction Using Neural Networks and Decision Trees , 2015 .

[53]  Mario Piattini,et al.  A Data Quality in Use model for Big Data , 2016, Future Gener. Comput. Syst..

[54]  Richard A. Johnson,et al.  Applied Multivariate Statistical Analysis , 1983 .

[55]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[56]  Barbara D. Klein,et al.  Data Quality in Linear Regression Models: Effect of Errors in Test Data and Errors in Training Data on Predictive Accuracy , 1999, Informing Sci. Int. J. an Emerg. Transdiscipl..

[57]  Guilherme Morbey Data Quality for Decision Makers , 2013 .

[58]  Ahmet Arslan,et al.  A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm , 2013, Inf. Sci..

[59]  P. Langley Selection of Relevant Features in Machine Learning , 1994 .

[60]  Abdulelah Alwabel,et al.  Toward a framework for data quality in cloud-based health information system , 2013, International Conference on Information Society (i-Society 2013).

[61]  Thomas Reinartz,et al.  CRISP-DM 1.0: Step-by-step data mining guide , 2000 .

[62]  J. R. Quinlan Learning With Continuous Classes , 1992 .

[63]  Yonggang Wen,et al.  Toward Scalable Systems for Big Data Analytics: A Technology Tutorial , 2014, IEEE Access.

[64]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[65]  A. T. Ringler,et al.  The data quality analyzer: A quality control program for seismic data , 2015, Comput. Geosci..

[66]  John Gantz,et al.  The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East , 2012 .

[67]  Stephen G. MacDonell,et al.  A Taxonomy of Data Quality Challenges in Empirical Software Engineering , 2013, 2013 22nd Australian Software Engineering Conference.

[68]  Jerzy W. Grzymala-Busse,et al.  A Comparison of Several Approaches to Missing Attribute Values in Data Mining , 2000, Rough Sets and Current Trends in Computing.

[69]  David J. DeWitt,et al.  Duplicate record elimination in large data files , 1983, TODS.

[70]  Khaled El Emam,et al.  Software Cost Estimation with Incomplete Data , 2001, IEEE Trans. Software Eng..

[71]  J. Kent Information gain and a general measure of correlation , 1983 .

[72]  Chunhua Weng,et al.  Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research , 2013, J. Am. Medical Informatics Assoc..

[73]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[74]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[75]  Juan Pardo,et al.  On-line learning of indoor temperature forecasting models towards energy efficiency , 2014 .

[76]  Jacques Ferlay,et al.  Evaluation of data quality at the National Cancer Registry of Ukraine. , 2018, Cancer epidemiology.

[77]  Ephrem Eyob Social Implications of Data Mining and Information Privacy: Interdisciplinary Frameworks and Solutions , 2008 .

[78]  Juan Carlos Corrales,et al.  Water quality warnings based on cluster analysis in Colombian river basins , 2015 .

[79]  Sreela Sasi,et al.  Proper imputation techniques for missing values in data sets , 2016, 2016 International Conference on Data Science and Engineering (ICDSE).

[80]  Muni S. Srivastava,et al.  Regression Analysis: Theory, Methods, and Applications , 1991 .

[81]  Davide Anguita,et al.  Machine learning approaches for improving condition-based maintenance of naval propulsion plants , 2016 .

[82]  Marcus O'Connor,et al.  Artificial neural network models for forecasting and decision making , 1994 .

[83]  Masoud Nikravesh,et al.  Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing) , 2006 .

[84]  Hui Xiong,et al.  Enhancing data analysis with noise removal , 2006, IEEE Transactions on Knowledge and Data Engineering.

[85]  Shanlin Yang,et al.  Data quality of electricity consumption data in a smart grid environment , 2017 .

[86]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[87]  Fei Chiang,et al.  A Data Quality Framework for Customer Relationship Analytics , 2015, WISE.

[88]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[89]  Veda C. Storey,et al.  A Framework for Analysis of Data Quality Research , 1995, IEEE Trans. Knowl. Data Eng..