论文信息 - ASSET queries: a declarative alternative to MapReduce

ASSET queries: a declarative alternative to MapReduce

Today's complex world requires state-of-the-art data analysis over truly massive data sets. These data sets can be stored persistently in databases or flat files, or can be generated in realtime in a continuous manner. An associated set is a collection of data sets, annotated by the values of a domain D. These data sets are populated using a data source according to a condition θ and the annotated value. An ASsociated SET (ASSET) query consists of repeated, successive, interrelated definitions of associated sets, put together in a column-wise fashion, resembling a spreadsheet document. We present DataMingler, a powerful GUI to express and manage ASSET queries, data sources and aggregate functions and the ASSET Query Engine (QE) to efficiently evaluate ASSET queries. We argue that ASSET queries: a) constitute a useful class of OLAP queries, b) are suitable for distributed processing settings, and c) extend the MapReduce paradigm in a declarative way.

Damianos Chatziantoniou | Elias Tzortzakakis

[1] Katerina Pramatari,et al. COSTES: Continuous spreadsheet-like computations , 2008, 2008 IEEE 24th International Conference on Data Engineering Workshop.

[2] Theodore Johnson,et al. The MD-join: an operator for complex OLAP , 2001, Proceedings 17th International Conference on Data Engineering.

[3] Abhinav Gupta,et al. Spreadsheets in RDBMS for OLAP , 2003, SIGMOD '03.

[4] Abraham Silberschatz,et al. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[5] Michael Stonebraker,et al. A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[6] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[7] Ravi Kumar,et al. Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[8] Nikos Mamoulis,et al. Efficient processing of joins on set-valued attributes , 2003, SIGMOD '03.

[9] Sanjay Ghemawat,et al. MapReduce: simplified data processing on large clusters , 2008, CACM.

[10] D. DeWitt. MapReduce: A major step backwards | The Database Column , 2011 .

[11] Kenneth A. Ross,et al. Realizing parallelism in database operations: insights from a massively multithreaded architecture , 2006, DaMoN '06.

[12] Kenneth A. Ross,et al. Querying Multiple Features of Groups in Relational Databases , 1996, VLDB.

[13] Damianos Chatziantoniou. Using grouping variables to express complex decision support queries , 2007, Data Knowl. Eng..

[14] Laks V. S. Lakshmanan,et al. Efficient OLAP query processing in distributed data warehouses , 2002, Proceedings 18th International Conference on Data Engineering.

[15] Michael Stonebraker,et al. C-Store: A Column-oriented DBMS , 2005, VLDB.

[16] Divesh Srivastava,et al. On computing correlated aggregates over continual data streams , 2001, SIGMOD '01.

[17] Bin Liu,et al. A Spreadsheet Algebra for a Direct Data Manipulation Query Interface , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[18] Peter M. G. Apers,et al. Optimization of Nested Queries in a Complex Object Model , 1994, EDBT.