Towards Scalable One-Pass Analytics Using MapReduce

An integral part of many data-intensive applications is the need to collect and analyze enormous datasets efficiently. Concurrent with such application needs is the increasing adoption of MapReduce as a programming model for processing large datasets using a cluster of machines. Current MapReduce systems, however, require the data set to be loaded into the cluster before running analytical queries, and thereby incur high delays to start query processing. Furthermore, existing systems are geared towards batch processing. In this paper, we seek to answer a fundamental question: what architectural changes are necessary to bring the benefits of the MapReduce computation model to incremental, one-pass analytics, i.e., to support stream processing and online aggregation? To answer this question, we first conduct a detailed empirical performance study of current MapReduce implementations including Hadoop and MapReduce Online using a variety of workloads. By doing so, we identify several drawbacks of existing systems for one-pass analytics. Based on the insights from our study, we conclude by listing key design requirements and arguing for architectural changes of MapReduce systems to overcome their current limitations and fully embrace incremental one-pass analytics and showing promising preliminary results.

[1]  Elke A. Rundensteiner,et al.  Scalable stream join processing with expensive predicates: workload distribution and adaptation by time-slicing , 2009, EDBT '09.

[2]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[3]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[4]  Rares Vernica,et al.  Hyracks: A flexible and extensible foundation for data-intensive computing , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[5]  Magdalena Balazinska,et al.  Skew-resistant parallel processing of feature-extracting scientific user-defined functions , 2010, SoCC '10.

[6]  William Thies,et al.  An empirical characterization of stream programs and its implications for language and compiler design , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[7]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[8]  Kun-Lung Wu,et al.  From a stream of relational queries to distributed stream processing , 2010, Proc. VLDB Endow..

[9]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[10]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[11]  Philip S. Yu,et al.  SPADE: the system s declarative stream processing engine , 2008, SIGMOD Conference.

[12]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[13]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[14]  Kun-Lung Wu,et al.  DEDUCE: at the intersection of MapReduce and stream processing , 2010, EDBT '10.

[15]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[16]  David J. DeWitt,et al.  Tuple Routing Strategies for Distributed Eddies , 2003, VLDB.

[17]  Frederick Reiss,et al.  TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[18]  Beng Chin Ooi,et al.  The performance of MapReduce , 2010, Proc. VLDB Endow..

[19]  Lenin Ravindranath,et al.  Nectar: Automatic Management of Data and Computation in Datacenters , 2010, OSDI.

[20]  Frederick Reiss,et al.  TelegraphCQ: Continuous Dataflow Processing for an Uncertain World , 2003, CIDR.

[21]  John Cieslewicz,et al.  SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions , 2009, Proc. VLDB Endow..

[22]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[23]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[24]  Jennifer Widom,et al.  The CQL continuous query language: semantic foundations and query execution , 2006, The VLDB Journal.

[25]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[26]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[27]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[28]  Leonard D. Shapiro,et al.  Join processing in database systems with large main memories , 1986, TODS.

[29]  Magdalena Balazinska,et al.  ParaTimer: a progress indicator for MapReduce DAGs , 2010, SIGMOD Conference.

[30]  Peter J. Haas,et al.  Ripple joins for online aggregation , 1999, SIGMOD '99.

[31]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[32]  William Thies,et al.  Teleport messaging for distributed stream programs , 2005, PPoPP.

[33]  Michael Isard,et al.  Distributed aggregation for data-parallel computing: interfaces and implementations , 2009, SOSP '09.

[34]  Chris Jermaine,et al.  A disk-based join with probabilistic guarantees , 2005, SIGMOD '05.

[35]  Robert Grimm,et al.  A Universal Calculus for Stream Processing Languages , 2010, ESOP.

[36]  David J. DeWitt,et al.  GAMMA - A High Performance Dataflow Database Machine , 1986, VLDB.