Only Aggressive Elephants are Fast Elephants

Yellow elephants are slow. A major reason is that they consume their inputs entirely before responding to an elephant rider's orders. Some clever riders have trained their yellow elephants to only consume parts of the inputs before responding. However, the teaching time to make an elephant do that is high. So high that the teaching lessons often do not pay off. We take a different approach. We make elephants aggressive; only this will make them very fast. We propose HAIL (Hadoop Aggressive Indexing Library), an enhancement of HDFS and Hadoop MapReduce that dramatically improves runtimes of several classes of MapReduce jobs. HAIL changes the upload pipeline of HDFS in order to create different clustered indexes on each data block replica. An interesting feature of HAIL is that we typically create a win-win situation: we improve both data upload to HDFS and the runtime of the actual Hadoop MapReduce job. In terms of data upload, HAIL improves over HDFS by up to 60% with the default replication factor of three. In terms of query execution, we demonstrate that HAIL runs up to 68x faster than Hadoop. In our experiments, we use six clusters including physical and EC2 clusters of up to 100 nodes. A series of scalability experiments also demonstrates the superiority of HAIL.

[1]  Jorge-Arnulfo Quiané-Ruiz,et al.  Trojan data layouts: right shoes for a running elephant , 2011, SoCC.

[2]  Jignesh M. Patel,et al.  A comparison of join algorithms for log processing in MaPreduce , 2010, SIGMOD Conference.

[3]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[4]  Anastasia Ailamaki,et al.  CoPhy: A Scalable, Portable, and Interactive Index Advisor for Large Workloads , 2011, Proc. VLDB Endow..

[5]  George Kollios,et al.  MRShare , 2010, Proc. VLDB Endow..

[6]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[7]  Ken Yocum,et al.  In-situ MapReduce for Log Processing , 2011, USENIX Annual Technical Conference.

[8]  Vinay Setty,et al.  Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing) , 2010, Proc. VLDB Endow..

[9]  Jorge-Arnulfo Quiané-Ruiz,et al.  Runtime measurements in the cloud , 2010, Proc. VLDB Endow..

[10]  Songting Chen,et al.  Cheetah , 2010, Proc. VLDB Endow..

[11]  Christopher Ré,et al.  Manimal: relational optimization for data-intensive programs , 2010, WebDB '10.

[12]  Surajit Chaudhuri,et al.  Constrained physical design tuning , 2009, The VLDB Journal.

[13]  Beng Chin Ooi,et al.  The performance of MapReduce , 2010, Proc. VLDB Endow..

[14]  Christopher Ré,et al.  Automatic Optimization for MapReduce Programs , 2011, Proc. VLDB Endow..

[15]  Yuanyuan Tian,et al.  CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop , 2011, Proc. VLDB Endow..

[16]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[17]  Zheng Shao,et al.  Data warehousing and analytics infrastructure at facebook , 2010, SIGMOD Conference.

[18]  Kenneth A. Ross,et al.  Making B+- trees cache conscious in main memory , 2000, SIGMOD '00.

[19]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[20]  Gang Chen,et al.  A Framework for supporting DBMS-like indexes in the cloud , 2011, Proc. VLDB Endow..

[21]  Surajit Chaudhuri,et al.  Index selection for databases: a hardness study and a principled heuristic solution , 2004, IEEE Transactions on Knowledge and Data Engineering.

[22]  Surajit Chaudhuri,et al.  Database tuning advisor for microsoft SQL server 2005: demo , 2005, SIGMOD '05.

[23]  Jorge-Arnulfo Quiané-Ruiz,et al.  RAFTing MapReduce: Fast recovery on the RAFT , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[24]  Surajit Chaudhuri,et al.  Database Tuning Advisor for Microsoft SQL Server 2005 , 2004, VLDB.

[25]  Jignesh M. Patel,et al.  Column-Oriented Storage Techniques for MapReduce , 2011, Proc. VLDB Endow..

[26]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[27]  David J. DeWitt,et al.  Weaving Relations for Cache Performance , 2001, VLDB.

[28]  Jignesh M. Patel,et al.  Energy management for MapReduce clusters , 2010, Proc. VLDB Endow..

[29]  Herodotos Herodotou,et al.  Profiling, what-if analysis, and cost-based optimization of MapReduce programs , 2011, Proc. VLDB Endow..

[30]  Jorge-Arnulfo Quiané-Ruiz,et al.  Efficient Big Data Processing in Hadoop MapReduce , 2012, Proc. VLDB Endow..

[31]  Ryan Johnson,et al.  Here are my Data Files. Here are my Queries. Where are my Results? , 2011, CIDR.

[32]  Jimmy Lin,et al.  Full-text indexing for optimizing selection operations in large-scale data analytics , 2011, MapReduce '11.