Query Rewriting for Incremental Continuous Query Evaluation in HIFUN

HIFUN is a high-level query language for expressing analytic queries of big datasets, offering a clear separation between the conceptual layer, where analytic queries are defined independently of the nature and location of data, and the physical layer, where queries are evaluated. In this paper, we present a methodology based on the HIFUN language, and the corresponding algorithms for the incremental evaluation of continuous queries. In essence, our approach is able to process the most recent data batch by exploiting already computed information, without requiring the evaluation of the query over the complete dataset. We present the generic algorithm which we translated to both SQL and MapReduce using SPARK; it implements various query rewriting methods. We demonstrate the effectiveness of our approach in temrs of query answering efficiency. Finally, we show that by exploiting the formal query rewriting methods of HIFUN, we can further reduce the computational cost, adding another layer of query optimization to our implementation.

[1]  Sofian Maabout,et al.  A framework for multidimensional skyline queries over streaming data , 2020, Data Knowl. Eng..

[2]  Gianluca Bontempi,et al.  SCARFF: A scalable framework for streaming credit card fraud detection with spark , 2017, Inf. Fusion.

[3]  Jennifer Widom,et al.  Database Systems: The Complete Book , 2001 .

[4]  Nicolas Spyratos,et al.  Towards Interactive Analytics over RDF Graphs , 2021, Algorithms.

[5]  Michal Kvet,et al.  Study on Effective Temporal Data Retrieval Leveraging Complex Indexed Architecture , 2021, Applied Sciences.

[6]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[7]  Milos Nikolic,et al.  DBToaster: Higher-order Delta Processing for Dynamic, Frequently Fresh Views , 2012, Proc. VLDB Endow..

[8]  Jeyhun Karimov,et al.  Benchmarking Distributed Stream Data Processing Systems , 2019, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  Nicolas Spyratos,et al.  A High Level Query Language for Big Data Analytics , 2014 .

[11]  Dimitris Plexousakis,et al.  Ontology Evolution in Data Integration: Query Rewriting to the Rescue , 2011, ER.

[12]  Dimitris Plexousakis,et al.  Exploring Importance Measures for Summarizing RDF/S KBs , 2017, ESWC.

[13]  Gottfried Vossen,et al.  Monotonic complements for independent data warehouses , 2001, The VLDB Journal.

[14]  Jennifer Widom,et al.  Database systems - the complete book (international edition) , 2002 .

[15]  Manolis Tsiknakis,et al.  Patient empowerment for cancer patients through a novel ICT infrastructure , 2019, J. Biomed. Informatics.

[16]  Sean Chester,et al.  Efficient top-k recently-frequent term querying over spatio-temporal textual streams , 2021, Inf. Syst..

[17]  Vickie Nguyen,et al.  Dashboard visualizations: Supporting real-time throughput decision-making , 2017, J. Biomed. Informatics.

[18]  Brenno de Mello Alencar,et al.  FoT-Stream: A Fog platform for data stream analytics in IoT , 2020, Comput. Commun..

[19]  Calton Pu,et al.  Continual Queries for Internet Scale Event-Driven Information Delivery , 1999, IEEE Trans. Knowl. Data Eng..

[20]  Douglas B. Terry,et al.  Continuous queries over append-only databases , 1992, SIGMOD '92.

[21]  Reynold Xin,et al.  Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark , 2018, SIGMOD Conference.

[22]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[23]  Kostas Stefanidis,et al.  Exploring RDFS KBs Using Summaries , 2018, International Semantic Web Conference.

[24]  Mario A. R. Dantas,et al.  An approach for SDN traffic monitoring based on big data techniques , 2019, J. Netw. Comput. Appl..

[25]  Veselka Boeva,et al.  Layered Integration Approach for Multi-view Analysis of Temporal Data , 2020, AALTD@PKDD/ECML.

[26]  Yogesh L. Simmhan,et al.  Granite: A distributed engine for scalable path queries over temporal property graphs , 2021, J. Parallel Distributed Comput..

[27]  Amir Shaikhha,et al.  DBToaster: higher-order delta processing for dynamic, frequently fresh views , 2012, The VLDB Journal.

[28]  Nicolas Spyratos,et al.  Incremental Evaluation of Continuous Analytic Queries in HIFUN , 2019, ISIP.

[29]  Chinmay Chakraborty,et al.  Emerging trends in IoT and big data analytics for biomedical and health care technologies , 2020 .

[30]  David J. DeWitt,et al.  NiagaraCQ: a scalable continuous query system for Internet databases , 2000, SIGMOD '00.

[31]  Themis Palpanas,et al.  Coconut: sortable summarizations for scalable indexes over static and streaming data series , 2019, The VLDB Journal.

[32]  Nicolas Spyratos,et al.  HIFUN - a high level functional query language for big data analytics , 2018, Journal of Intelligent Information Systems.

[33]  Tariq Rahim Soomro,et al.  Big Data Analysis: Apache Storm Perspective , 2015 .

[34]  Kostas Stefanidis,et al.  Incremental Data Partitioning of RDF Data in SPARK , 2018, ESWC.

[35]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[36]  Jennifer Widom,et al.  Continuous queries over data streams , 2001, SGMD.

[37]  Craig Chambers,et al.  The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing , 2015, Proc. VLDB Endow..