A Cautionary Tale for Machine Learning Design: why we Still Need Human-Assisted Big Data Analysis

Supervised Machine Learning (ML) requires that smart algorithms scrutinize a very large number of labeled samples before they can make right predictions. And this is not always true either. In our experience, in fact, a neural network trained with a huge database comprised of over fifteen million water meter readings had essentially failed to predict when a meter would malfunction/need disassembly based on a history of water consumption measurements. With a second step, we developed a methodology, based on the enforcement of a specialized data semantics, that allowed us to extract only those samples for training that were not noised by data impurities. With this methodology, we re-trained the neural network up to a prediction accuracy of over 80%. Yet, we simultaneously realized that the new training dataset was significantly different from the initial one in statistical terms, and much smaller, as well. We had reached a sort of paradox: We had alleviated the initial problem with a better interpretable model, but we had changed the replicated form of the initial data. To reconcile that paradox, we further enhanced our data semantics with the contribution of field experts. This has finally led to the extrapolation of a training dataset truly representative of regular/defective water meters and able to describe the underlying statistical phenomenon, while still providing an excellent prediction accuracy of the resulting classifier. At the end of this path, the lesson we have learnt is that a human-in-the-loop approach may significantly help to clean and re-organize noised datasets for an empowered ML design experience.

[1]  Lene Pettersen,et al.  Why Artificial Intelligence Will Not Outsmart Complex Knowledge Work , 2018, Work, Employment and Society.

[2]  Ping Yu,et al.  A Review of Data Quality Assessment Methods for Public Health Information Systems , 2014, International journal of environmental research and public health.

[3]  Hamidah Ibrahim,et al.  Data quality: A survey of data quality dimensions , 2012, 2012 International Conference on Information Retrieval & Knowledge Management.

[4]  M. Roccetti,et al.  A Paradox in ML Design: Less data for a smarter water metering cognification experience , 2019, GOODTECHS.

[5]  Veda C. Storey,et al.  A Framework for Analysis of Data Quality Research , 1995, IEEE Trans. Knowl. Data Eng..

[6]  Ian H. Witten,et al.  Chapter 10 – Deep learning , 2017 .

[7]  NicolleChristophe,et al.  Understandable Big Data , 2015 .

[8]  Richard Y. Wang,et al.  Data quality assessment , 2002, CACM.

[9]  Michael I. Jordan,et al.  Machine learning: Trends, perspectives, and prospects , 2015, Science.

[10]  Marco Roccetti,et al.  Deep Water: Predicting Water Meter Failures Through a Human-Machine Intelligence Collaboration , 2019 .

[11]  Carlo Batini,et al.  Data Quality at a Glance , 2005, Datenbank-Spektrum.

[12]  Marco Roccetti,et al.  Intelligent and Good Machines? The Role of Domain and Context Codification , 2020, Mob. Networks Appl..

[13]  Christophe Nicolle,et al.  Understandable Big Data: A survey , 2015, Comput. Sci. Rev..

[14]  B. Snel,et al.  Comparative assessment of large-scale data sets of protein–protein interactions , 2002, Nature.

[15]  Michael Felderer,et al.  Risk-based data validation in machine learning-based software systems , 2019, MaLTeSQuE@ESEC/SIGSOFT FSE.

[16]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[17]  Andree E. Widjaja,et al.  Facebook C2C social commerce: A study of online impulse buying , 2016, Decis. Support Syst..

[18]  Domenica Taruscio,et al.  Data Quality in Rare Diseases Registries. , 2017, Advances in experimental medicine and biology.

[19]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[20]  ChenJengchung Victor,et al.  Facebook C2C social commerce , 2016 .

[21]  Yangyong Zhu,et al.  The Challenges of Data Quality and Data Quality Assessment in the Big Data Era , 2015, Data Sci. J..

[22]  Heri Ramampiaro,et al.  Enhancing Big Data with Semantics: The AsterixDB Approach (Poster) , 2018, 2018 IEEE 12th International Conference on Semantic Computing (ICSC).

[23]  Peter Burggräf,et al.  Data quality-based process enabling: Application to logistics supply processes in low-volume ramp-up context , 2018, 2018 International Conference on Information Management and Processing (ICIMP).

[24]  Marco Valtorta,et al.  The Effects of Data Quality on Machine Learning Algorithms , 2006, ICIQ.

[25]  Marco Roccetti,et al.  Is bigger always better? A controversial journey to the center of machine learning design, with uses and misuses of big data for predicting water meter failures , 2019, Journal of Big Data.