The Internet of Things: Opportunities and Challenges for Distributed Data Analysis

Nowadays, data is created by humans as well as automatically collected by physical things, which embed electronics, software, sensors and network connectivity. Together, these entities constitute the Internet of Things (IoT). The automated analysis of its data can provide insights into previously unknown relationships between things, their environment and their users, facilitating an optimization of their behavior. Especially the real-time analysis of data, embedded into physical systems, can enable new forms of autonomous control. These in turn may lead to more sustainable applications, reducing waste and saving resources IoT's distributed and dynamic nature, resource constraints of sensors and embedded devices as well as the amounts of generated data are challenging even the most advanced automated data analysis methods known today. In particular, the IoT requires a new generation of distributed analysis methods. Many existing surveys have strongly focused on the centralization of data in the cloud and big data analysis, which follows the paradigm of parallel high-performance computing. However, bandwidth and energy can be too limited for the transmission of raw data, or it is prohibited due to privacy constraints. Such communication-constrained scenarios require decentralized analysis algorithms which at least partly work directly on the generating devices. After listing data-driven IoT applications, in contrast to existing surveys, we highlight the differences between cloudbased and decentralized analysis from an algorithmic perspective. We present the opportunities and challenges of research on communication-efficient decentralized analysis algorithms. Here, the focus is on the difficult scenario of vertically partitioned data, which covers common IoT use cases. The comprehensive bibliography aims at providing readers with a good starting point for their own work

[1]  Ursula Gather,et al.  Robust online signal extraction from multivariate time series , 2010, Comput. Stat. Data Anal..

[2]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[3]  Michael ten Hompel,et al.  PhyNode: An intelligent, cyber-physical system with energy neutral operation for PhyNetLab , 2015 .

[4]  Martin J. Wainwright,et al.  Information-theoretic lower bounds for distributed statistical estimation with communication constraints , 2013, NIPS.

[5]  Michael Meier,et al.  Learning SQL for Database Intrusion Detection using Context-Sensitive Modelling , 2009, LWA.

[6]  Hillol Kargupta,et al.  MineFleet®: an overview of a widely adopted distributed vehicle performance data mining system , 2010, KDD.

[7]  James M. McCaw,et al.  Forecasting influenza outbreak dynamics in Melbourne from Internet search query surveillance data , 2016, Influenza and other respiratory viruses.

[8]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[9]  Ursula Gather,et al.  Dimension Reduction for Physiological Variables Using Graphical Modeling , 2003, AMIA.

[10]  Katharina Morik,et al.  Introduction to data mining for sustainability , 2011, Data Mining and Knowledge Discovery.

[11]  Kanishka Bhaduri,et al.  Distributed Data Mining in Sensor Networks , 2013, Managing and Mining Sensor Data.

[12]  E. R. Davies Computer and Machine Vision: Theory, Algorithms, Practicalities , 2012 .

[13]  Antonio Iera,et al.  The Internet of Things: A survey , 2010, Comput. Networks.

[14]  Hillol Kargupta,et al.  A local asynchronous distributed privacy preserving feature selection algorithm for large peer-to-peer networks , 2009, Knowledge and Information Systems.

[15]  Laurence T. Yang,et al.  Data Mining for Internet of Things: A Survey , 2014, IEEE Communications Surveys & Tutorials.

[16]  M. Nanni Mobility , Data Mining and Privacy – the GeoPKDD project , 2009 .

[17]  Katharina Morik,et al.  Sustainable Industrial Processes by Embedded Real-Time Quality Prediction , 2016, Computational Sustainability.

[18]  Yue Zhao,et al.  New formulation and optimization methods for water sensor placement , 2016, Environ. Model. Softw..

[19]  Gernot Heiser,et al.  An Analysis of Power Consumption in a Smartphone , 2010, USENIX Annual Technical Conference.

[20]  Ran Wolff,et al.  In-Network Outlier Detection in Wireless Sensor Networks , 2006, ICDCS.

[21]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[22]  Athanasios V. Vasilakos,et al.  Data Mining for the Internet of Things: Literature Review and Challenges , 2015, Int. J. Distributed Sens. Networks.

[23]  Michele Zorzi,et al.  Health care applications: a solution based on the internet of things , 2011, ISABEL '11.

[24]  Athanasios V. Vasilakos,et al.  When things matter: A survey on data-centric internet of things , 2016, J. Netw. Comput. Appl..

[25]  Peter C. Evans,et al.  Industrial Internet: Pushing the Boundaries of Minds and Machines , 2012 .

[26]  Charu C. Aggarwal,et al.  The Internet of Things: A Survey from the Data-Centric Perspective , 2013, Managing and Mining Sensor Data.

[27]  James H. Aylor,et al.  Computer for the 21st Century , 1999, Computer.

[28]  Zafar A. Khan,et al.  Load forecasting, dynamic pricing and DSM in smart grid: A review , 2016 .

[29]  Lida Xu,et al.  IoT and Cloud Computing in Automation of Assembly Modeling Systems , 2014, IEEE Transactions on Industrial Informatics.

[30]  Karl Rihaczek,et al.  1. WHAT IS DATA MINING? , 2019, Data Mining for the Social Sciences.

[31]  Marimuthu Palaniswami,et al.  Internet of Things (IoT): A vision, architectural elements, and future directions , 2012, Future Gener. Comput. Syst..

[32]  Athanasios V. Vasilakos,et al.  Future Internet of Things: open issues and challenges , 2014, Wireless Networks.

[33]  Michael Meier,et al.  Learning SQL for Database Intrusion Detection Using Context-Sensitive Modelling (Extended Abstract) , 2009, DIMVA.

[34]  Maria-Florina Balcan,et al.  Distributed Learning, Communication Complexity and Privacy , 2012, COLT.

[35]  Katharina Morik,et al.  Separable Approximate Optimization of Support Vector Machines for Distributed Sensing , 2012, ECML/PKDD.

[36]  Gianmarco De Francisci Morales,et al.  SAMOA: scalable advanced massive online analysis , 2015, J. Mach. Learn. Res..

[37]  Shanlin Yang,et al.  Understanding household energy consumption behavior: The contribution of energy big data analytics , 2016 .

[38]  Tom Fawcett Mining the Quantified Self: Personal Knowledge Discovery as a Challenge for Data Science , 2015, Big Data.

[39]  Min Chen,et al.  Enabling comfortable sports therapy for patient: A novel lightweight durable and portable ECG monitoring system , 2013, 2013 IEEE 15th International Conference on e-Health Networking, Applications and Services (Healthcom 2013).

[40]  Manuel Díaz,et al.  State-of-the-art, challenges, and open issues in the integration of Internet of things and cloud computing , 2016, J. Netw. Comput. Appl..

[41]  Nathan Marz,et al.  Big Data: Principles and best practices of scalable realtime data systems , 2015 .

[42]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[43]  Oliver Kramer,et al.  Wind Power Prediction with Machine Learning , 2016, Computational Sustainability.

[44]  Wolfgang Bernhart,et al.  Autonomous Driving: Disruptive Innovation that Promises to Change the Automotive Industry as We Know It , 2016 .

[45]  Koen Vanhoof,et al.  Research Challenges in Ubiquitous Knowledge Discovery , 2008, Next Generation of Data Mining.

[46]  Wu He,et al.  Internet of Things in Industries: A Survey , 2014, IEEE Transactions on Industrial Informatics.

[47]  Donald Shoup Free Parking or Free Markets , 2011 .

[48]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[49]  Katharina Morik,et al.  Learning from Label Proportions by Optimizing Cluster Model Selection , 2011, ECML/PKDD.

[50]  James Long,et al.  Smart Sensors: A Study of Power Consumption and Reliability , 2015 .

[51]  Kanishka Bhaduri,et al.  Distributed Support Vector Machines: An Overview , 2016, Solving Large Scale Learning Tasks.

[52]  P. Bocquier WORLD URBANIZATION PROSPECTS: AN ALTERNATIVE TO THE UN MODEL OF PROJECTION COMPATIBLE WITH URBAN TRANSITION THEORY 1 , 2005 .

[53]  Richard Nock,et al.  (Almost) No Label No Cry , 2014, NIPS.

[54]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[55]  Jennifer Widom,et al.  Database Systems: The Complete Book , 2001 .

[56]  Erhan Guven,et al.  A Survey of Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection , 2016, IEEE Communications Surveys & Tutorials.

[57]  Guoping He,et al.  Privacy-Preserving SVM Classification on Vertically Partitioned Data without Secure Multi-party Computation , 2009, ICNC.

[58]  Kun Liu,et al.  VEDAS: A Mobile and Distributed Data Stream Mining System for Real-Time Vehicle Monitoring , 2004, SDM.

[59]  Mykola Pechenizkiy,et al.  An Overview of Concept Drift Applications , 2016 .

[60]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[61]  Vipin Kumar,et al.  Chapman & Hall/CRC Data Mining and Knowledge Discovery Series , 2008 .

[62]  E. R. Davies Computer and Machine Vision, Fourth Edition: Theory, Algorithms, Practicalities , 2012 .

[63]  Oliver Kramer,et al.  Statistical Learning for Short-Term Photovoltaic Power Predictions , 2016, Computational Sustainability.

[64]  Katharina Morik,et al.  Challenges for Data Mining on Sensor Data of Interlinked Processes , 2012 .

[65]  Shen Bin,et al.  Research on data mining models for the internet of things , 2010, 2010 International Conference on Image Analysis and Signal Processing.

[66]  Jingxiong Zhang,et al.  Anomaly detection in MODIS land products via time series analysis , 2007 .

[67]  James Brusey,et al.  Edge Mining the Internet of Things , 2013, IEEE Sensors Journal.

[68]  Kanishka Bhaduri,et al.  Distributed anomaly detection using 1‐class SVM for vertically partitioned data , 2011, Stat. Anal. Data Min..

[69]  Dave Evans,et al.  How the Next Evolution of the Internet Is Changing Everything , 2011 .

[70]  Felix Wortmann,et al.  Internet of Things , 2015, Business & Information Systems Engineering.

[71]  Klaus Nordhausen,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition by Trevor Hastie, Robert Tibshirani, Jerome Friedman , 2009 .

[72]  Graham J. Williams,et al.  Data Mining , 2000, Communications in Computer and Information Science.

[73]  Christian Bockermann,et al.  Mining big data streams for multiple concepts , 2015 .

[74]  Jan Baumbach,et al.  Computational Methods for Metabolomic Data Analysis of Ion Mobility Spectrometry Data—Reviewing the State of the Art , 2012, Metabolites.

[75]  Imrich Chlamtac,et al.  Internet of things: Vision, applications and research challenges , 2012, Ad Hoc Networks.

[76]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[77]  Vasant Honavar,et al.  Analysis and Synthesis of Agents That Learn from Distributed Dynamic Data Sources , 2001, Emergent Neural Computational Architectures Based on Neuroscience.

[78]  Brian McWilliams,et al.  DUAL-LOCO: Preserving privacy between features in distributed estimation , 2016, AISTATS 2016.

[79]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[80]  Simon G. M. Koo,et al.  Integration of Smart Sensor Networks into Internet of Things: Challenges and Applications , 2013, 2013 IEEE International Conference on Green Computing and Communications and IEEE Internet of Things and IEEE Cyber, Physical and Social Computing.

[81]  Rodrigo Roman,et al.  On the features and challenges of security and privacy in distributed internet of things , 2013, Comput. Networks.

[82]  Friedemann Mattern,et al.  From the Internet of Computers to the Internet of Things , 2010, From Active Data Management to Event-Based Systems and More.

[83]  Malte Brettel,et al.  How Virtualization, Decentralization and Network Building Change the Manufacturing Landscape: An Industry 4.0 Perspective , 2014 .

[84]  Katharina Morik,et al.  Predictive Trip Planning - Smart Routing in Smart Cities , 2014, EDBT/ICDT Workshops.

[85]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[86]  Antonio Pescapè,et al.  Integration of Cloud computing and Internet of Things: A survey , 2016, Future Gener. Comput. Syst..

[87]  Assaf Schuster,et al.  Communication-Efficient Distributed Online Prediction by Dynamic Model Synchronization , 2014, ECML/PKDD.

[88]  Patrick Engebretson What is Penetration Testing , 2013 .

[89]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[90]  Jesús S. Aguilar-Ruiz,et al.  Knowledge discovery from data streams , 2009, Intell. Data Anal..

[91]  Lakhmi C. Jain,et al.  Feature Selection for Data and Pattern Recognition , 2014, Feature Selection for Data and Pattern Recognition.

[92]  Marimuthu Palaniswami,et al.  Smart car parking: Temporal clustering and anomaly detection in urban car parking , 2014, 2014 IEEE Ninth International Conference on Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP).

[93]  Glenn Fung,et al.  Privacy-preserving classification of vertically partitioned data via random kernels , 2008, TKDD.

[94]  K. Morik,et al.  Communication-efficient learning of traffic flow in a network of wireless presence sensors , 2015 .

[95]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[96]  Katharina Morik,et al.  Online Analysis of High-Volume Data Streams in Astroparticle Physics , 2015, ECML/PKDD.