Jumping the ORDER BY Barrier in Large-Scale Pattern Matching

Event-series pattern matching is a major component of large-scale data analytics pipelines enabling a wide range of system diagnostics tasks. A precursor to pattern matching is an expensive ``shuffle the world'' stage wherein data are ordered by time and shuffled across the network. Because many existing systems treat the pattern matching engine as a black box, they are unable to optimizing the entire data analytics pipeline, and in particular, this costly shuffle. This paper demonstrates how to optimize such queries. We first translate an expressive class of regular-expression like patterns to relational queries such that they can benefit from decades of progress in relational optimizers, and then we introduce the technique of abstract pattern matching, a linear time preprocessing step which, adapting ideas from symbolic execution and abstract interpretation, discards events from the input guaranteed not to appear in successful matches. Abstract pattern matching first computes a conservative representation of the output-relevant domain of every transition in a pattern based on the (unary) predicates of that transition. It then further refines these domains based on the structure of the pattern (i.e., paths through the pattern) as well as any of the pattern's join predicates across transitions. The outcome is an abstract filter that when applied to the original stream excludes events that are guaranteed not to participate in a match. We implemented and applied abstract pattern matching in COSMOS/Scope to an industrial benchmark where we obtained up to 3 orders of magnitude reduction in shuffled data and 1.23x average speedup in total processing time.

[1]  Neil Immerman,et al.  Efficient pattern matching over event streams , 2008, SIGMOD Conference.

[2]  Martin Hirzel,et al.  Partition and compose: parallel complex event processing , 2012, DEBS.

[3]  Neil Immerman,et al.  On complexity and optimization of expensive queries in complex event processing , 2014, SIGMOD Conference.

[4]  Johannes Gehrke,et al.  Distributed event stream processing with non-deterministic finite automata , 2009, DEBS '09.

[5]  Johannes Gehrke,et al.  Cayuga: a high-performance event processing engine , 2007, SIGMOD '07.

[6]  David Carasso,et al.  Exploring Splunk , 2012 .

[7]  Elke A. Rundensteiner,et al.  Event Stream Processing with Out-of-Order Data Arrival , 2007, 27th International Conference on Distributed Computing Systems Workshops (ICDCSW'07).

[8]  Patrick Cousot,et al.  Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints , 1977, POPL.

[9]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[10]  Michael H. Böhlen,et al.  Efficient event pattern matching with match windows , 2012, KDD.

[11]  Ugur Çetintemel,et al.  Plan-based complex event detection across distributed sources , 2008, Proc. VLDB Endow..

[12]  Todd Mytkowicz,et al.  Parallelizing user-defined aggregations using symbolic execution , 2015, SOSP.

[13]  James C. King,et al.  Symbolic execution and program testing , 1976, CACM.

[14]  Martin Wolf,et al.  Efficient Pattern Detection Over a Distributed Framework , 2014, BIRTE.

[15]  Johannes Gehrke,et al.  Cayuga: A General Purpose Event Monitoring System , 2007, CIDR.

[16]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[17]  Michael H. Böhlen,et al.  Sequenced event set pattern matching , 2011, EDBT/ICDT '11.

[18]  Nesime Tatbul,et al.  RIP: run-based intra-query parallelism for scalable complex event processing , 2013, DEBS.

[19]  Theodore Johnson,et al.  Monitoring Regular Expressions on Out-of-Order Streams , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[20]  C. Zaniolo,et al.  Expressing and optimizing sequence queries in database systems , 2004, TODS.

[21]  Badrish Chandramouli,et al.  High-performance dynamic pattern matching over disordered streams , 2010, Proc. VLDB Endow..

[22]  Hassen Saïdi,et al.  Construction of Abstract State Graphs with PVS , 1997, CAV.