Improving the predictability of distributed stream processors

Next generation real-time applications demand big-data infrastructures to process huge and continuous data volumes under complex computational constraints. This type of application raises new issues on current big-data processing infrastructures. The first issue to be considered is that most of current infrastructures for big-data processing were defined for general purpose applications. Thus, they set aside real-time performance, which is in some cases an implicit requirement. A second important limitation is the lack of clear computational models that could be supported by current big-data frameworks. In an effort to reduce this gap, this article contributes along several lines. First, it provides a set of improvements to a computational model called distributed stream processing in order to formalize it as a real-time infrastructure. Second, it proposes some extensions to Storm, one of the most popular stream processors. These extensions are designed to gain an extra control over the resources used by the application in order to improve its predictability. Lastly, the article presents some empirical evidences on the performance that can be expected from this type of infrastructure. Model combining stream processing technology and real-time.Extensions to the Storm processor.Performance evaluation of the extension on a cluster.

[1]  Roberto Baldoni,et al.  Adaptive online scheduling in storm , 2013, DEBS.

[2]  Thomas S. Heinze,et al.  Cloud-based data stream processing , 2014, DEBS '14.

[3]  Andy J. Wellings,et al.  Architecture-Awareness for Real-Time Big Data Systems , 2014, EuroMPI/ASIA.

[4]  Alan Burns,et al.  A survey of hard real-time scheduling for multiprocessor systems , 2011, CSUR.

[5]  Pavel Smrz,et al.  Scheduling Decisions in Stream Processing on Heterogeneous Clusters , 2014, 2014 Eighth International Conference on Complex, Intelligent and Software Intensive Systems.

[6]  Vijay V. Raghavan,et al.  Big Data: Promises and Problems , 2015, Computer.

[7]  Ying Wang,et al.  Scheduling Mixed Real-Time and Non-real-Time Applications in MapReduce Environment , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[8]  Marisol García-Valls,et al.  Towards a reconfiguration service for distributed real-time Java , 2012, REACTION.

[9]  Marisol García-Valls,et al.  A Distributed Real-Time Java-Centric Architecture for Industrial Systems , 2014, IEEE Transactions on Industrial Informatics.

[10]  Vipin Kumar,et al.  Trends in big data analytics , 2014, J. Parallel Distributed Comput..

[11]  Yoonho Park,et al.  SPC: a distributed, scalable platform for data mining , 2006, DMSSP '06.

[12]  Andy Wellings,et al.  Distributed, Embedded and Real-time Java Systems , 2012 .

[13]  Daniel F. García,et al.  Minimum and maximum utilization bounds for multiprocessor rate monotonic scheduling , 2004, IEEE Transactions on Parallel and Distributed Systems.

[14]  Jignesh M. Patel,et al.  Big data and its technical challenges , 2014, CACM.

[15]  Anwar M. Ghuloum,et al.  ViewpointFace the inevitable, embrace parallelism , 2009, CACM.

[16]  Jean Bacon,et al.  SEEP: scalable and elastic event processing , 2010, Middleware Posters '10.

[17]  Marisol García-Valls,et al.  Low complexity reconfiguration for real-time data-intensive service-oriented applications , 2014, Future Gener. Comput. Syst..

[18]  Marisol García-Valls,et al.  A simple distributed garbage collector for distributed real-time Java , 2014, The Journal of Supercomputing.

[19]  Michael Stonebraker,et al.  The 8 requirements of real-time stream processing , 2005, SGMD.

[20]  Michael Stonebraker,et al.  Aurora: a new model and architecture for data stream management , 2003, The VLDB Journal.

[21]  Zhuo Tang,et al.  The Implementation of MapReduce Scheduling Algorithm Based on Priority , 2013, ParCo 2013.

[22]  Pablo Basanta Val,et al.  Comparative analysis of two different middleware approaches for reconfiguration of distributed real-time systems , 2014 .

[23]  Victor I. Chang,et al.  The Business Intelligence as a Service in the Cloud , 2014, Future Gener. Comput. Syst..

[24]  Lei Yu,et al.  A novel real-time scheduling algorithm and performance analysis of a MapReduce-based cloud , 2014, The Journal of Supercomputing.

[25]  Paul Zikopoulos,et al.  Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data , 2011 .

[26]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Data stream clustering: A survey , 2013, CSUR.

[27]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[28]  Cees T. A. M. de Laat,et al.  Addressing Big Data challenges for Scientific Data Infrastructure , 2012, 4th IEEE International Conference on Cloud Computing Technology and Science Proceedings.

[29]  Lukasz Golab,et al.  Issues in data stream management , 2003, SGMD.

[30]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[31]  Vern Paxson,et al.  @spam: the underground on 140 characters or less , 2010, CCS '10.

[32]  Li Tu,et al.  Density-based clustering for real-time stream data , 2007, KDD '07.

[33]  Rajkumar Buyya,et al.  Article in Press Future Generation Computer Systems ( ) – Future Generation Computer Systems Cloud Computing and Emerging It Platforms: Vision, Hype, and Reality for Delivering Computing as the 5th Utility , 2022 .

[34]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[35]  Jennifer Widom,et al.  STREAM: the stanford stream data manager (demonstration description) , 2003, SIGMOD '03.

[36]  Jennifer Widom,et al.  STREAM: The Stanford Stream Data Manager , 2003, IEEE Data Eng. Bull..

[37]  Tarun Chordia,et al.  High-Frequency Trading , 2013 .

[38]  Giuseppe Antonio Di Luna,et al.  An event-based platform for collaborative threats detection and monitoring , 2014, Inf. Syst..

[39]  Beng Chin Ooi,et al.  Distributed data management using MapReduce , 2014, CSUR.

[40]  Alan Burns,et al.  Real Time Scheduling Theory: A Historical Perspective , 2004, Real-Time Systems.

[41]  Jimmy J. Lin,et al.  Scaling big data mining infrastructure: the twitter experience , 2013, SKDD.

[42]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[43]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[44]  Luciana Arantes,et al.  MRA++: Scheduling and data placement on MapReduce for heterogeneous environments , 2015, Future Gener. Comput. Syst..

[45]  Jin-Soo Kim,et al.  Large-scale incremental processing with MapReduce , 2014, Future Gener. Comput. Syst..

[46]  Gul A. Agha,et al.  ACTORS - a model of concurrent computation in distributed systems , 1985, MIT Press series in artificial intelligence.

[47]  Adam Jacobs,et al.  The pathologies of big data , 2009, Commun. ACM.

[48]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.