Declarative error management for robust data-intensive applications

We present an approach to declaratively manage run-time errors in data-intensive applications. When large volumes of raw data meet complex third-party libraries, deterministic run-time errors become likely, and existing query processors typically stop without returning a result when a run-time error occurs. The ability to degrade gracefully in the presence of run-time errors, and partially execute jobs, is typically limited to specific operators such as bulkloading. We generalize this concept to all operators of a query processing system, introducing a novel data type "partial result with errors" and corresponding operators. We show how to extend existing error-unaware operators to support this type, and as an added benefit, eliminate side-effect based error reporting. We use declarative specifications of acceptable results to control the semantics of error-aware operators. We have incorporated our approach into a declarative query processing system, which compiles the language constructs into instrumented execution plans for clusters of machines. We experimentally validate that the instrumentation overhead is below 20% in microbenchmarks, and not detectable when running I/O-intensive workloads.

[1]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[2]  John B. Goodenough,et al.  Exception handling: issues and a proposed notation , 1975, CACM.

[3]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[4]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[5]  Martin Odersky,et al.  Colored local type inference , 2001, POPL '01.

[6]  Christopher Olston,et al.  Inspector gadget: a framework for custom monitoring and debugging of distributed dataflows , 2011, SIGMOD '11.

[7]  Michael J. Carey,et al.  Reducing the Braking Distance of an SQL Query Engine , 1998, VLDB.

[8]  Gregor Kiczales,et al.  Aspect-oriented programming , 2001, ESEC/FSE-9.

[9]  Goetz Graefe,et al.  Query evaluation techniques for large databases , 1993, CSUR.

[10]  Philip Wadler,et al.  Monads for functional programming , 1995, NATO ASI PDC.

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Scott Boag,et al.  XQuery 1.0 : An XML Query Language , 2007 .

[13]  Alexander Borgida,et al.  Language features for flexible handling of exceptions in information systems , 1985, TODS.

[14]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[15]  Andrey Balmin,et al.  Jaql , 2011, Proc. VLDB Endow..

[16]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[17]  Carlo Zaniolo,et al.  Database relations with null values , 1982, J. Comput. Syst. Sci..