Approximate Edge Analytics for the IoT Ecosystem

IoT-enabled devices continue to generate a massive amount of data. Transforming this continuously arriving raw data into timely insights is critical for many modern online services. For such settings, the traditional form of data analytics over the entire dataset would be prohibitively limiting and expensive for supporting real-time stream analytics. In this work, we make a case for approximate computing for data analytics in IoT settings. Approximate computing aims for efficient execution of workflows where an approximate output is sufficient instead of the exact output. The idea behind approximate computing is to compute over a representative sample instead of the entire input dataset. Thus, approximate computing - based on the chosen sample size - can make a systematic trade-off between the output accuracy and computation efficiency. This motivated the design of APPROXIOT - a data analytics system for approximate computing in IoT. To realize this idea, we designed an online hierarchical stratified reservoir sampling algorithm that uses edge computing resources to produce approximate output with rigorous error bounds. To showcase the effectiveness of our algorithm, we implemented APPROXIOT based on Apache Kafka and evaluated its effectiveness using a set of microbenchmarks and real-world case studies. Our results show that APPROXIOT achieves a speedup 1.3X-9.9X with varying sampling fraction of 80% to 10% compared to simple random sampling.

[1]  Pramod Bhatotia,et al.  Large-scale Incremental Data Processing with Change Propagation , 2011, HotCloud.

[2]  Pramod Bhatotia,et al.  Incoop: MapReduce for incremental computations , 2011, SoCC.

[3]  Rajeev Rastogi,et al.  Processing complex aggregate queries over data streams , 2002, SIGMOD '02.

[4]  Akshat Verma,et al.  Shredder: GPU-accelerated incremental storage and computation , 2012, FAST.

[5]  Onur Mutlu,et al.  Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds , 2017, NSDI.

[6]  Pramod Bhatotia,et al.  iThreads: A Threading Library for Parallel Incremental Computation , 2015, ASPLOS.

[7]  Swaminathan Natarajan Imprecise and Approximate Computation , 1995 .

[8]  Christof Fetzer,et al.  IncApprox: A Data Analytics System for Incremental Approximate Computing , 2016, WWW.

[9]  Pramod Bhatotia,et al.  Incremental parallel and distributed systems , 2015 .

[10]  T. V. Lakshman,et al.  Bringing the cloud to the edge , 2014, 2014 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS).

[11]  Feng Gao,et al.  CityBench: A Configurable Benchmark to Evaluate RSP Engines Using Smart City Datasets , 2015, SEMWEB.

[12]  R. Rodrigues,et al.  Conductor: orchestrating the clouds , 2010, LADIS '10.

[13]  Chris Jermaine,et al.  Scalable approximate query processing with the DBO engine , 2007, SIGMOD '07.

[14]  Christof Fetzer,et al.  PrivApprox: Privacy-Preserving Stream Analytics , 2019, Informatik Spektrum.

[15]  Tilmann Rabl,et al.  Optimized on-demand data streaming from sensor nodes , 2017, SoCC.

[16]  Umut A. Acar,et al.  Slider : Incremental Sliding-Window Computations for Large-Scale Data Analysis , 2012 .

[17]  Byung Suk Lee,et al.  Stratified Reservoir Sampling over Heterogeneous Data Streams , 2010, SSDBM.

[18]  F. Pukelsheim The Three Sigma Rule , 1994 .

[19]  Simon J. Godsill,et al.  On sequential Monte Carlo sampling methods for Bayesian filtering , 2000, Stat. Comput..

[20]  Christof Fetzer,et al.  Approximate Distributed Joins in Apache Spark , 2018, ArXiv.

[21]  Margarida Mamede,et al.  PIXIDA: Optimizing Data Parallel Jobs in Wide-Area Data Analytics , 2015, Proc. VLDB Endow..

[22]  Pramod Bhatotia,et al.  Brief announcement: modelling MapReduce for optimal execution in the cloud , 2010, PODC.

[23]  Holger Ziekow,et al.  The DEBS 2015 grand challenge , 2015, DEBS.

[24]  Sharon L. Lohr,et al.  Sampling: Design and Analysis , 1999 .

[25]  Pramod Bhatotia,et al.  Slider: incremental sliding window analytics , 2014, Middleware.

[26]  Mahadev Satyanarayanan,et al.  The Emergence of Edge Computing , 2017, Computer.

[27]  Christof Fetzer,et al.  Approximate Stream Analytics in Apache Flink and Apache Spark Streaming , 2017, ArXiv.

[28]  Christof Fetzer,et al.  StreamApprox: approximate computing for stream analytics , 2017, Middleware.

[29]  Pramod Bhatotia,et al.  Orchestrating the Deployment of Computations in the Cloud with Conductor , 2012, NSDI.

[30]  Aditya Akella,et al.  CLARINET: WAN-Aware Optimization for Analytics Queries , 2016, OSDI.

[31]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[32]  Srikanth Kandula,et al.  Quickr: Lazily Approximating Complex AdHoc Queries in BigData Clusters , 2016, SIGMOD Conference.

[33]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[34]  Marios D. Dikaiakos,et al.  AdaM: An adaptive monitoring framework for sampling and filtering on IoT devices , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[35]  Christof Fetzer,et al.  Privacy Preserving Stream Analytics: The Marriage of Randomized Response and Approximate Computing , 2017, ArXiv.

[36]  Teruo Higashino,et al.  Edge-centric Computing: Vision and Challenges , 2015, CCRV.

[37]  Thu D. Nguyen,et al.  ApproxHadoop: Bringing Approximations to MapReduce Frameworks , 2015, ASPLOS.