The Seven Deadly Sins of Cloud Computing Research

Research into distributed parallelism on "the cloud" has surged lately. As the research agenda and methodology in this area are being established, we observe a tendency towards certain common simplifications and shortcuts employed by researchers, which we provocatively term "sins". We believe that these sins, in some cases, are threats to the scientific integrity and practical applicability of the research presented. In this paper, we identify and discuss seven "deadly sins" (many of which we have ourselves committed!), present evidence illustrating that they pose real problems, and discuss ways for the community to avoid them in the future.

[1]  Derek Gordon Murray,et al.  A distributed execution engine supporting data-dependent control flow , 2012 .

[2]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[3]  Chao Tian,et al.  Nova: continuous Pig/Hadoop workflows , 2011, SIGMOD '11.

[4]  Chita R. Das,et al.  Modeling and synthesizing task placement constraints in Google compute clusters , 2011, SoCC.

[5]  Michael I. Jordan,et al.  Managing data transfers in computer clusters with orchestra , 2011, SIGCOMM.

[6]  Huan Liu,et al.  Cloud MapReduce: A MapReduce Implementation on Top of a Cloud Operating System , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[7]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[8]  Jinyang Li,et al.  Piccolo: Building Fast, Distributed Programs with Partitioned Tables , 2010, OSDI.

[9]  Albert G. Greenberg,et al.  Scarlett: coping with skewed content popularity in mapreduce clusters , 2011, EuroSys '11.

[10]  Yanfeng Zhang,et al.  iMapReduce: A Distributed Computing Framework for Iterative Computation , 2011, Journal of Grid Computing.

[11]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[12]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[13]  Chita R. Das,et al.  Towards characterizing cloud backend workloads: insights from Google compute clusters , 2010, PERV.

[14]  Ihor Kuz,et al.  Multicore OS Benchmarks: We Can Do Better , 2011, HotOS.

[15]  Xiaowei Yang,et al.  CloudCmp: comparing public cloud providers , 2010, IMC '10.

[16]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[17]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[18]  Antony Rowstron,et al.  Nobody ever got fired for using Hadoop on a cluster , 2012, HotCDP '12.

[19]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[20]  Craig Chambers,et al.  FlumeJava: easy, efficient data-parallel pipelines , 2010, PLDI '10.

[21]  Yanfeng Zhang,et al.  PrIter: A Distributed Framework for Prioritizing Iterative Computations , 2011, IEEE Transactions on Parallel and Distributed Systems.

[22]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[23]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[24]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[25]  Vinay Setty,et al.  Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing) , 2010, Proc. VLDB Endow..

[26]  Jorge-Arnulfo Quiané-Ruiz,et al.  Runtime measurements in the cloud , 2010, Proc. VLDB Endow..

[27]  Rajeev Gandhi,et al.  An Analysis of Traces from a Production MapReduce Cluster , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[28]  Tamer Elsayed,et al.  iHadoop: Asynchronous Iterations for MapReduce , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[29]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[30]  Robert L. Grossman,et al.  Sector and Sphere: the design and implementation of a high-performance data cloud , 2009, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[31]  Tao Yang,et al.  The Panasas ActiveScale Storage Cluster - Delivering Scalable High Bandwidth Storage , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[32]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[33]  Michael Isard,et al.  TidyFS: A Simple and Small Distributed File System , 2011, USENIX Annual Technical Conference.

[34]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[35]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[36]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[37]  Bingsheng He,et al.  Comet: batched stream processing for data intensive distributed computing , 2010, SoCC '10.

[38]  Andrew V. Goldberg,et al.  Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.

[39]  Steven Hand,et al.  CIEL: A Universal Execution Engine for Distributed Data-Flow Computing , 2011, NSDI.

[40]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[41]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[42]  Michael I. Jordan,et al.  Managing data transfers in computer clusters with orchestra , 2011, SIGCOMM 2011.

[43]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[44]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[45]  Rares Vernica,et al.  Hyracks: A flexible and extensible foundation for data-intensive computing , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[46]  Scott Shenker,et al.  Disk-Locality in Datacenter Computing Considered Irrelevant , 2011, HotOS.