论文信息 - Shark: fast data analysis using coarse-grained distributed memory

Shark: fast data analysis using coarse-grained distributed memory

Shark is a research data analysis system built on a novel coarse-grained distributed shared-memory abstraction. Shark marries query processing with deep data analysis, providing a unified system for easy data manipulation using SQL and pushing sophisticated analysis closer to data. It scales to thousands of nodes in a fault-tolerant manner. Shark can answer queries 40X faster than Apache Hive and run machine learning programs 25X faster than MapReduce programs in Apache Hadoop on large datasets.

[1] Timos K. Sellis,et al. Multiple-query optimization , 1988, TODS.

[2] Patrick Valduriez,et al. Principles of Distributed Database Systems , 1990 .

[3] Patrick Valduriez,et al. Principles of distributed database systems (2nd ed.) , 1999 .

[4] Arlo Faria,et al. MapReduce : Distributed Computing for Machine Learning , 2006 .

[5] Kunle Olukotun,et al. Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[6] Michael J. Franklin,et al. Shared query processing in data streaming systems , 2006 .

[7] Michael Stonebraker,et al. H-store: a high-performance, distributed main memory transaction processing system , 2008, Proc. VLDB Endow..

[8] Sanjay Ghemawat,et al. MapReduce: simplified data processing on large clusters , 2008, CACM.

[9] Ravi Kumar,et al. Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[10] Jingren Zhou,et al. SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[11] Michael Stonebraker,et al. A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[12] Joseph M. Hellerstein,et al. MAD Skills: New Analysis Practices for Big Data , 2009, Proc. VLDB Endow..

[13] Abraham Silberschatz,et al. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[14] Parag Agrawal,et al. The case for RAMClouds: scalable high-performance storage entirely in DRAM , 2010, OPSR.

[15] Zheng Shao,et al. Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[16] Andrey Gubarev,et al. Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[17] Michael D. Ernst,et al. HaLoop , 2010, Proc. VLDB Endow..

[18] Joseph M. Hellerstein,et al. MapReduce Online , 2010, NSDI.

[19] Liang Lin,et al. Tenzing a SQL implementation on the MapReduce framework , 2011, Proc. VLDB Endow..

[20] Zhiwei Xu,et al. RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[21] Scott Shenker,et al. Disk-Locality in Datacenter Computing Considered Irrelevant , 2011, HotOS.

[22] Alin Deutsch,et al. ASTERIX: towards a scalable, semistructured data platform for evolving-world models , 2011, Distributed and Parallel Databases.

[23] Rares Vernica,et al. Hyracks: A flexible and extensible foundation for data-intensive computing , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[24] Randy H. Katz,et al. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[25] Abraham Silberschatz,et al. Efficient processing of data warehousing queries in a split execution environment , 2011, SIGMOD '11.

[26] P. Cochat,et al. Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[27] Michael J. Franklin,et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[28] Srikanth Kandula,et al. PACMan: Coordinated Memory Caching for Parallel Jobs , 2012, NSDI.