On Observability and Monitoring of Distributed Systems: An Industry Interview Study

Business success of companies heavily depends on the availability and performance of their client applications. Due to modern development paradigms such as DevOps and microservice architectural styles, applications are decoupled into services with complex interactions and dependencies. Although these paradigms enable individual development cycles with reduced delivery times, they cause several challenges to manage the services in distributed systems. One major challenge is to observe and monitor such distributed systems. This paper provides a qualitative study to understand the challenges and good practices in the field of observability and monitoring of distributed systems. In 28 semi-structured interviews with software professionals we discovered increasing complexity and dynamics in that field. Especially observability becomes an essential prerequisite to ensure stable services and further development of client applications. However, the participants mentioned a discrepancy in the awareness regarding the importance of the topic, both from the management as well as from the developer perspective. Besides technical challenges, we identified a strong need for an organizational concept including strategy, roles and responsibilities. Our results support practitioners in developing and implementing systematic observability and monitoring for distributed systems.

[1]  Ying Li,et al.  Transparently Capturing Execution Path of Service/Job Request Processing , 2018, ICSOC.

[2]  Rodrigo Fonseca,et al.  Principled workflow-centric tracing of distributed systems , 2016, SoCC.

[3]  Rajiv Ranjan,et al.  Holistic Performance Monitoring of Hybrid Clouds: Complexities and Future Directions , 2016, IEEE Cloud Computing.

[4]  Zibin Zheng,et al.  Microscope: Pinpoint Performance Issues with Causal Graphs in Micro-service Environments , 2018, ICSOC.

[5]  Rajiv Ranjan,et al.  An overview of the commercial cloud monitoring tools: research dimensions, design issues, and state-of-the-art , 2013, Computing.

[6]  P. Mayring Qualitative content analysis: theoretical foundation, basic procedures and software solution , 2014 .

[7]  M. Gopal,et al.  Modern Control System Theory , 1984 .

[8]  Lawrence Chung,et al.  Estimating the Performance of Cloud-Based Systems Using Benchmarking and Simulation in a Complementary Manner , 2018, ICSOC.

[9]  Theo Lynn,et al.  A survey of Cloud monitoring tools: Taxonomy, capabilities and objectives , 2014, J. Parallel Distributed Comput..

[10]  Meng Li,et al.  Constraint-Based Model-Driven Testing of Web Services for Behavior Conformance , 2018, ICSOC.

[11]  Per Runeson,et al.  Guidelines for conducting and reporting case study research in software engineering , 2009, Empirical Software Engineering.

[12]  Niall Murphy,et al.  Site Reliability Engineering: How Google Runs Production Systems , 2016 .

[13]  Dimitra Simeonidou,et al.  Multilevel Observability in Cloud Orchestration , 2018, 2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech).

[14]  Timothy C. Lethbridge,et al.  Software Engineering Data Collection for Field Studies , 2008, Guide to Advanced Empirical Software Engineering.

[15]  Antonio Ruiz Cortés,et al.  An Analysis of RESTful APIs Offerings in the Industry , 2017, ICSOC.

[16]  Wilhelm Hasselbring,et al.  Drivers and Barriers for Microservice Adoption - A Survey among Professionals in Germany , 2019, Enterp. Model. Inf. Syst. Archit. Int. J. Concept. Model..

[17]  Gargi Dasgupta,et al.  Runtime Monitoring in Continuous Deployment by Differencing Execution Behavior Model , 2018, ICSOC.

[18]  André van Hoorn,et al.  Application Performance Management: State of the Art and Challenges for the Future , 2017, ICPE.

[19]  Antonio Pescapè,et al.  Cloud monitoring: A survey , 2013, Comput. Networks.

[20]  Philipp Mayring,et al.  Qualitative Content Analysis: Theoretical Background and Procedures , 2015 .

[21]  Antonella Longo,et al.  Public Cloud Adoption in Multinational Companies: A Survey , 2018, 2018 IEEE International Conference on Services Computing (SCC).