Generating example data for dataflow programs

While developing data-centric programs, users often run (portions of) their programs over real data, to see how they behave and what the output looks like. Doing so makes it easier to formulate, understand and compose programs correctly, compared with examination of program logic alone. For large input data sets, these experimental runs can be time-consuming and inefficient. Unfortunately, sampling the input data does not always work well, because selective operations such as filter and join can lead to empty results over sampled inputs, and unless certain indexes are present there is no way to generate biased samples efficiently. Consequently new methods are needed for generating example input data for data-centric programs. We focus on an important category of data-centric programs, dataflow programs, which are best illustrated by displaying the series of intermediate data tables that occur between each pair of operations. We introduce and study the problem of generating example intermediate data for dataflow programs, in a manner that illustrates the semantics of the operators while keeping the example data small. We identify two major obstacles that impede naive approaches, namely (1) highly selective operators and (2) noninvertible operators, and offer techniques for dealing with these obstacles. Our techniques perform well on real dataflow programs used at Yahoo! for web analytics.

[1]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[2]  A. Connes,et al.  Formule de trace en géométrie non-commutative et hypothèse de Riemann , 1996 .

[3]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[4]  Alain Connes,et al.  Trace formula in noncommutative geometry and the zeros of the Riemann zeta function , 1998, math/9811068.

[5]  Qiang Chen,et al.  Aurora : a new model and architecture for data stream management ) , 2006 .

[6]  Carsten Binnig,et al.  QAGen: generating query-aware test databases , 2007, SIGMOD '07.

[7]  Remzi H. Arpaci-Dusseau,et al.  Run-time adaptation in river , 2003, TOCS.

[8]  D. R. Heath-Brown,et al.  The Theory of the Riemann Zeta-Function , 1987 .

[9]  Michael Stonebraker,et al.  Tioga: Providing Data Management Support for Scientific Visualization Applications , 1993, VLDB.

[10]  David S. Johnson,et al.  Approximation algorithms for combinatorial problems , 1973, STOC.

[11]  Helly Fourier transforms in the complex domain , 1936 .

[12]  Heikki Mannila,et al.  Test data for relational queries , 1985, PODS '86.

[13]  Michael Stonebraker,et al.  Supporting fine-grained data lineage in a database visualization environment , 1997, Proceedings 13th International Conference on Data Engineering.

[14]  Carsten Binnig,et al.  Reverse Query Processing , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[15]  Prabhakar Goel,et al.  PODEM-X: An Automatic Test Generation System for VLSI Logic Structures , 1981, 18th Design Automation Conference.

[16]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[17]  Vasek Chvátal,et al.  A Greedy Heuristic for the Set-Covering Problem , 1979, Math. Oper. Res..

[18]  StonebrakerMichael,et al.  Aurora: a new model and architecture for data stream management , 2003, VLDB 2003.

[19]  Bart Kuijpers,et al.  Introduction to constraint databases , 2002, SGMD.

[20]  Bogdan Korel,et al.  Automated Software Test Data Generation , 1990, IEEE Trans. Software Eng..

[21]  Rajeev Motwani,et al.  On random sampling over joins , 1999, SIGMOD '99.