Is bigger always better? A controversial journey to the center of machine learning design, with uses and misuses of big data for predicting water meter failures

In this paper, we describe the design of a machine learning-based classifier, tailored to predict whether a water meter will fail or need a replacement. Our initial attempt to train a recurrent deep neural network (RNN), based on the use of 15 million of readings gathered from 1 million of mechanical water meters, spread throughout Northern Italy, led to non-positive results. We learned this was due to a lack of specific attention devoted to the quality of the analyzed data. We, hence, developed a novel methodology, based on a new semantics which we enforced on the training data. This allowed us to extract only those samples which are representative of the complex phenomenon of defective water meters. Adopting such a methodology, the accuracy of our RNN exceeded the 80% threshold. We simultaneously realized that the new training dataset differed significantly, in statistical terms, from the initial dataset, leading to an apparent paradox. Thus, with our contribution, we have demonstrated how to reconcile such a paradox, showing that our classifier can help detecting defective meters, while simplifying replacement procedures.

[1]  Morteza Heidari,et al.  Applying a new computer-aided detection scheme generated imaging marker to predict short-term breast cancer risk , 2018, Physics in medicine and biology.

[2]  Christophe Nicolle,et al.  Understandable Big Data: A survey , 2015, Comput. Sci. Rev..

[3]  Carlos León,et al.  An Approach to Detection of Tampering in Water Meters , 2015, KES.

[4]  Jianmin Pan,et al.  ROC-ing along: Evaluation and interpretation of receiver operating characteristic curves. , 2016, Surgery.

[5]  Marco Roccetti,et al.  Intelligent and Good Machines? The Role of Domain and Context Codification , 2020, Mob. Networks Appl..

[6]  Taghi M. Khoshgoftaar,et al.  A survey on addressing high-class imbalance in big data , 2018, Journal of Big Data.

[7]  Timnit Gebru,et al.  Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , 2018, FAT.

[8]  Yang Wang,et al.  Domain Knowledge in Predictive Maintenance for Water Pipe Failures , 2018, Human and Machine Learning.

[9]  Michael I. Jordan,et al.  Machine learning: Trends, perspectives, and prospects , 2015, Science.

[10]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[11]  F. Cabitza,et al.  Unintended Consequences of Machine Learning in Medicine , 2017, JAMA.

[12]  Cesare Stefanelli,et al.  Wireless Middleware Solutions for Smart Water Metering , 2019, Sensors.

[13]  Habib Ullah Khan,et al.  Big data analytics: does organizational factor matters impact technology acceptance? , 2017, Journal of Big Data.

[14]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[15]  Katarzyna Pietrucha-Urbanik Failure Prediction in Water Supply System - Current Issues , 2015, DepCoS-RELCOMEX.

[16]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[17]  M. Roccetti,et al.  A Paradox in ML Design: Less data for a smarter water metering cognification experience , 2019, GOODTECHS.

[18]  Alison M. St. Clair,et al.  State-of-the-technology review on water pipe condition, deterioration and failure rate prediction models! , 2012 .

[19]  Alaa Tharwat,et al.  Classification assessment methods , 2020, Applied Computing and Informatics.

[20]  Adam Tauman Kalai,et al.  Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , 2016, NIPS.

[21]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[22]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[23]  Genevera I. Allen Statistical data integration: Challenges and opportunities , 2017 .

[24]  Lene Pettersen,et al.  Why Artificial Intelligence Will Not Outsmart Complex Knowledge Work , 2018, Work, Employment and Society.

[25]  Heri Ramampiaro,et al.  Enhancing Big Data with Semantics: The AsterixDB Approach (Poster) , 2018, 2018 IEEE 12th International Conference on Semantic Computing (ICSC).

[26]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[27]  Marco Roccetti,et al.  Deep Water: Predicting Water Meter Failures Through a Human-Machine Intelligence Collaboration , 2019 .