Overview and Importance of Data Quality for Machine Learning Tasks

It is well understood from literature that the performance of a machine learning (ML) model is upper bounded by the quality of the data. While researchers and practitioners have focused on improving the quality of models (such as neural architecture search and automated feature selection), there are limited efforts towards improving the data quality. One of the crucial requirements before consuming datasets for any application is to understand the dataset at hand and failure to do so can result in inaccurate analytics and unreliable decisions. Assessing the quality of the data across intelligently designed metrics and developing corresponding transformation operations to address the quality gaps helps to reduce the effort of a data scientist for iterative debugging of the ML pipeline to improve model performance. This tutorial highlights the importance of analysing data quality in terms of its value for machine learning applications. This tutorial surveys all the important data quality related approaches discussed in literature, focusing on the intuition behind them, highlighting their strengths and similarities, and illustrates their applicability to real-world problems. Finally we will discuss the interesting work IBM Research is doing in this space.

[1]  Isaac L. Chuang,et al.  Confident Learning: Estimating Uncertainty in Dataset Labels , 2019, J. Artif. Intell. Res..

[2]  James Y. Zou,et al.  Data Shapley: Equitable Valuation of Data for Machine Learning , 2019, ICML.

[3]  M. de Rijke,et al.  Improving Neural Response Diversity with Frequency-Aware Cross-Entropy Loss , 2019, WWW.

[4]  Neoklis Polyzotis,et al.  Data Validation for Machine Learning , 2019, SysML.

[5]  Sumit Gulwani,et al.  Automating string processing in spreadsheets using input-output examples , 2011, POPL '11.

[6]  Lingjia Tang,et al.  Outlier Detection for Improved Data Quality and Diversity in Dialog Systems , 2019, NAACL.

[7]  Nikolai Rozanov,et al.  Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks , 2018, CoNLL.

[8]  Charu C. Aggarwal,et al.  Outlier Detection for Text Data , 2017, SDM.

[9]  Michael J. Muller,et al.  How Data Science Workers Work with Data: Discovery, Capture, Curation, Design, Creation , 2019, CHI.

[10]  Pushmeet Kohli,et al.  RobustFill: Neural Program Learning under Noisy I/O , 2017, ICML.

[11]  Beng Chin Ooi,et al.  Rapid Identification of Column Heterogeneity , 2006, Sixth International Conference on Data Mining (ICDM'06).

[12]  Misha Denil,et al.  Overlap versus Imbalance , 2010, Canadian Conference on AI.

[13]  GulwaniSumit Automating string processing in spreadsheets using input-output examples , 2011 .

[14]  Cornelia Kiefer Quality Indicators for Text Data , 2019, BTW.

[15]  Sercan O. Arik,et al.  Data Valuation using Reinforcement Learning , 2019, ICML.

[16]  Helena Galhardas,et al.  A Taxonomy of Data Quality Problems , 2005 .

[17]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[18]  Maria Liakata,et al.  Aiming beyond the Obvious: Identifying Non-Obvious Cases in Semantic Similarity Datasets , 2019, ACL.

[19]  Laure Berti-Équille,et al.  Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation , 2019, WWW.