Towards the optimization of a parallel streaming engine for telco applications

Parallel and distributed computing is becoming essential to process in real time the increasingly massive volume of data collected by telecommunications companies. Existing computational paradigms such as MapReduce (and its popular open-source implementation Hadoop) provide a scalable, fault tolerant mechanism for large scale batch computations. However, many applications in the telco ecosystem require a real time, incremental streaming approach to process data in real time and enable proactive care. Storm is a scalable, fault tolerant framework for the analysis of real time streaming data. In this paper we provide a motivation for the use of real time streaming analytics in the telco ecosystem. We perform an experimental investigation into the performance of Storm, focusing in particular on the impact of parameter configuration. This investigation reveals that optimal parameter choice is highly non-trivial and we use this as motivation to create a parameter configuration engine. As first steps towards the creation of this engine we provide a deep analysis of the inner workings of Storm and provide a set of models describing data flow cost, central processing unit (CPU) cost, and system management cost.

[1]  Ioana Giurgiu Understanding performance modeling for modular mobile-cloud applications , 2012, ICPE '12.

[2]  Herodotos Herodotou,et al.  MapReduce programming and cost-based optimization? , 2011, Proc. VLDB Endow..

[3]  Tim Kraska,et al.  Stormy: an elastic and highly available streaming service in the cloud , 2012, EDBT-ICDT '12.

[4]  Sherif Sakr,et al.  Modeling performance of a parallel streaming engine: bridging theory and costs , 2013, ICPE '13.

[5]  Pramod Bhatotia,et al.  Incoop: MapReduce for incremental computations , 2011, SoCC.

[6]  Rahul Telang,et al.  Network neighbor effects on customer churn in cell phone networks , 2011 .

[7]  Philip S. Yu,et al.  SPADE: the system s declarative stream processing engine , 2008, SIGMOD Conference.

[8]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9]  João Gama,et al.  Data Streams - Models and Algorithms , 2007, Advances in Database Systems.

[10]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[11]  Deepak S. Turaga,et al.  Processing 6 billion CDRs/day: from research to production (experience report) , 2012, DEBS.

[12]  Christian Sohler,et al.  StreamKM++: A clustering algorithm for data streams , 2010, JEAL.

[13]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[14]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[15]  Walid G. Aref,et al.  M3: Stream Processing on Main-Memory MapReduce , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[16]  Michael Stonebraker,et al.  Aurora: a data stream management system , 2003, SIGMOD '03.

[17]  Mari-Liis Lamp,et al.  LBS in marketing and tourism management: measuring destination loyalty with mobile positioning data , 2010, J. Locat. Based Serv..

[18]  Alex Pentland,et al.  Composite Social Network for Predicting Mobile Apps Installation , 2011, AAAI.

[19]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.