Message-oriented Middleware for Scalable Data Analytics Architectures

The democratization of Internet allowed many more people to use online services and enjoy their benefits. The traffic towards websites became tremendous those recent years, especially with the apparition of social networks. Mobile application, televisions and other non--‐computer devices also get connected to the Internet and use it to provide services to the end--‐users: Video on--‐demand, music streaming and so on. These applications rely on powerful backend servers that handle the requests made by devices and provide statistics and metrics about application usage. These metrics can be generated by aggregating the access logs (e.g. HTTP requests log), logs that are potentially extremely large. Big data tools and analytics, providing a way to handle this huge number of records, come then in hand, as typical client--‐server architectures, with a single database storing all the data, reach their limits in terms of performance and capacity. Data duplication, combined to dedicated and specialized databases storing it, is the key to efficient data handling. How to fill up these databases in an elegant, efficient and scalable manner is the remaining question, and message--‐oriented middleware may be a viable answer. This project aims at exploring the capabilities of such middleware, identifying what are the benefits and the drawbacks in using them and presenting how they can be integrated in a real--‐world application that needs to aggregate events and logs on a large scale. Apache Kafka and RabbitMQ, two message--‐oriented middleware, are benchmarked and compared, on both performance metrics and qualitative criteria. A fully working proof--‐ of--‐concept (of an already--‐existing industry product modified to use a message--‐oriented middleware and a specialized data warehouse system) is developed and presented, to conclude on the usefulness of message--‐oriented middleware when designing scalable data analytics architectures.

[1]  James Steven Perry Java management extensions - managing Java applications with JMX , 2002 .

[2]  Edward Curry,et al.  Message‐Oriented Middleware , 2005 .

[3]  Dominic Duggan,et al.  Service-Oriented Architecture , 2012 .

[4]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[5]  Sasu Tarkoma,et al.  Publish / Subscribe Systems: Design and Principles , 2012 .

[6]  Binildas A. Christudas,et al.  Service Oriented Architecture with Java , 2008 .

[7]  Scott Berinato,et al.  With big data comes big responsibility , 2014 .

[8]  Jean-Louis Maréchaux,et al.  Combining Service-Oriented Architecture and Event-Driven Architecture using an Enterprise Service Bus Level : Advanced , 2006 .

[9]  Sasu Tarkoma,et al.  Standards and Products , 2012 .

[10]  Kristopher Welsh,et al.  The danger of big data: Social media as computational social science , 2012, First Monday.

[11]  Wayne Kondro,et al.  Medical data debates: Big is better? Small is beautiful? , 2011, Canadian Medical Association Journal.

[12]  G. Lawrence Sanders,et al.  Denormalization strategies for data retrieval from data warehouses , 2006, Decis. Support Syst..

[13]  David E. Culler,et al.  SEDA: an architecture for well-conditioned, scalable internet services , 2001, SOSP.

[14]  Jay Kreps,et al.  Kafka : a Distributed Messaging System for Log Processing , 2011 .

[15]  Brian T. Kurotsuchi The wonders of Java object serialization , 1997, CROS.

[16]  Todd R. Johnson,et al.  Using Common Table Expressions to Build a Scalable Boolean Query Generator for Clinical Data Warehouses , 2014, IEEE Journal of Biomedical and Health Informatics.

[17]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[18]  John O'Hara,et al.  Toward a Commodity Enterprise Middleware , 2007, ACM Queue.

[19]  D. Wogan Big data, big energy. , 2013 .

[20]  Robert Tozer,et al.  Data Centre Energy Efficiency Analysis to minimize total cost of ownership , 2013 .

[21]  Steve Vinoski,et al.  Advanced Message Queuing Protocol , 2006, IEEE Internet Computing.

[22]  Michael Minelli,et al.  Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today's Businesses , 2012 .

[23]  Kazuaki Maeda,et al.  Performance evaluation of object serialization libraries in XML, JSON and binary formats , 2012, 2012 Second International Conference on Digital Information and Communication Technology and it's Applications (DICTAP).