Faster: A Low Overhead Framework for Massive Data Analysis

With the recent accelerated increase in the amount of social data available in the Internet, several big data distributed processing frameworks have been proposed and implemented. Hadoop has been used widely to process all kinds of data, not only from social media. Spark is gaining popularity for offering a more flexible, object-functional, programming interface, and also by improving performance in many cases. However, not all data analysis algorithms perform well on Hadoop or Spark. For instance, graph algorithms tend to generate large amounts of messages between processing elements, which may result in poor performance even in Spark. We introduce Faster, a low latency distributed processing framework, designed to explore data locality to reduce processing costs in such algorithms. It offers an API similar to Spark, but with a slightly different execution model and new operators. Our results show that it can significantly outperform Spark on large graphs, being up to one orders of magnitude faster when running PageRank in a partial Google+ friendship graph with more than one billion edges.

[1]  Michael Isard,et al.  Scalability! But at what COST? , 2015, HotOS.

[2]  Jack J. Dongarra,et al.  Exascale computing and big data , 2015, Commun. ACM.

[3]  Virgílio A. F. Almeida,et al.  New kid on the block: exploring the google+ social graph , 2012, Internet Measurement Conference.

[4]  Christoforos E. Kozyrakis,et al.  Phoenix rebirth: Scalable MapReduce on a large-scale shared-memory system , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[5]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[6]  Eric Horvitz,et al.  Predicting Depression via Social Media , 2013, ICWSM.

[7]  Michael R. Lyu,et al.  SoRec: social recommendation using probabilistic matrix factorization , 2008, CIKM '08.

[8]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[9]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[10]  Scott Shenker,et al.  Making Sense of Performance in Data Analytics Frameworks , 2015, NSDI.

[11]  C. Lynch Big data: How do your data grow? , 2008, Nature.

[12]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[13]  Timothy G. Armstrong,et al.  LinkBench: a database benchmark based on the Facebook social graph , 2013, SIGMOD '13.

[14]  Shirish Tatikonda,et al.  From "Think Like a Vertex" to "Think Like a Graph" , 2013, Proc. VLDB Endow..

[15]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[16]  Reynold Xin,et al.  Scaling Spark in the Real World: Performance and Usability , 2015, Proc. VLDB Endow..

[17]  Dianne Lux Wigand,et al.  Twitter in Government: Building Relationships One Tweet at a Time , 2010, 2010 Seventh International Conference on Information Technology: New Generations.

[18]  Amin Vahdat,et al.  TritonSort: A Balanced and Energy-Efficient Large-Scale Sorting System , 2013, TOCS.