ASSET queries: a declarative alternative to MapReduce

Today's complex world requires state-of-the-art data analysis over truly massive data sets. These data sets can be stored persistently in databases or flat files, or can be generated in realtime in a continuous manner. An associated set is a collection of data sets, annotated by the values of a domain D. These data sets are populated using a data source according to a condition θ and the annotated value. An ASsociated SET (ASSET) query consists of repeated, successive, interrelated definitions of associated sets, put together in a column-wise fashion, resembling a spreadsheet document. We present DataMingler, a powerful GUI to express and manage ASSET queries, data sources and aggregate functions and the ASSET Query Engine (QE) to efficiently evaluate ASSET queries. We argue that ASSET queries: a) constitute a useful class of OLAP queries, b) are suitable for distributed processing settings, and c) extend the MapReduce paradigm in a declarative way.

[1]  Katerina Pramatari,et al.  COSTES: Continuous spreadsheet-like computations , 2008, 2008 IEEE 24th International Conference on Data Engineering Workshop.

[2]  Theodore Johnson,et al.  The MD-join: an operator for complex OLAP , 2001, Proceedings 17th International Conference on Data Engineering.

[3]  Abhinav Gupta,et al.  Spreadsheets in RDBMS for OLAP , 2003, SIGMOD '03.

[4]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[5]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[6]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[7]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[8]  Nikos Mamoulis,et al.  Efficient processing of joins on set-valued attributes , 2003, SIGMOD '03.

[9]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[10]  D. DeWitt MapReduce: A major step backwards | The Database Column , 2011 .

[11]  Kenneth A. Ross,et al.  Realizing parallelism in database operations: insights from a massively multithreaded architecture , 2006, DaMoN '06.

[12]  Kenneth A. Ross,et al.  Querying Multiple Features of Groups in Relational Databases , 1996, VLDB.

[13]  Damianos Chatziantoniou Using grouping variables to express complex decision support queries , 2007, Data Knowl. Eng..

[14]  Laks V. S. Lakshmanan,et al.  Efficient OLAP query processing in distributed data warehouses , 2002, Proceedings 18th International Conference on Data Engineering.

[15]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[16]  Divesh Srivastava,et al.  On computing correlated aggregates over continual data streams , 2001, SIGMOD '01.

[17]  Bin Liu,et al.  A Spreadsheet Algebra for a Direct Data Manipulation Query Interface , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[18]  Peter M. G. Apers,et al.  Optimization of Nested Queries in a Complex Object Model , 1994, EDBT.