Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study

The complexity and dynamism of microservice systems pose unique challenges to a variety of software engineering tasks such as fault analysis and debugging. In spite of the prevalence and importance of microservices in industry, there is limited research on the fault analysis and debugging of microservice systems. To fill this gap, we conduct an industrial survey to learn typical faults of microservice systems, current practice of debugging, and the challenges faced by developers in practice. We then develop a medium-size benchmark microservice system (being the largest and most complex open source microservice system within our knowledge) and replicate 22 industrial fault cases on it. Based on the benchmark system and the replicated fault cases, we conduct an empirical study to investigate the effectiveness of existing industrial debugging practices and whether they can be further improved by introducing the state-of-the-art tracing and visualization techniques for distributed systems. The results show that the current industrial practices of microservice debugging can be improved by employing proper tracing and visualization techniques and strategies. Our findings also suggest that there is a strong need for more intelligent trace analysis and visualization, e.g., by combining trace visualization and improved fault localization, and employing data-driven and learning-based recommendation for guided visual exploration and comparison of traces.

[1]  Lina Yao,et al.  Fault Detection and Localization in Distributed Systems Using Recurrent Convolutional Neural Networks , 2017, ADMA.

[2]  Patricia Lago,et al.  Research on Architecting Microservices: Trends, Focus, and Potential for Industrial Adoption , 2017, 2017 IEEE International Conference on Software Architecture (ICSA).

[3]  Andreas Zeller,et al.  Simplifying and Isolating Failure-Inducing Input , 2002, IEEE Trans. Software Eng..

[4]  Frank Siqueira,et al.  An architecture to automate performance tests on microservices , 2016, iiWAS.

[5]  Lionel C. Briand,et al.  Using Mutation Analysis for Assessing and Comparing Testing Coverage Criteria , 2006, IEEE Transactions on Software Engineering.

[6]  Rami Bahsoon,et al.  Microservices and Their Design Trade-Offs: A Self-Adaptive Roadmap , 2016, 2016 IEEE International Conference on Services Computing (SCC).

[7]  Michel Dagenais,et al.  Wait Analysis of Distributed Systems Using Kernel Tracing , 2016, IEEE Transactions on Parallel and Distributed Systems.

[8]  Ravishankar K. Iyer,et al.  Failure Diagnosis for Distributed Systems Using Targeted Fault Injection , 2017, IEEE Transactions on Parallel and Distributed Systems.

[9]  Slinger Jansen,et al.  Workload-Based Clustering of Coherent Feature Sets in Microservice Architectures , 2017, 2017 IEEE International Conference on Software Architecture (ICSA).

[10]  Nour Ali,et al.  A Systematic Mapping Study in Microservice Architecture , 2016, 2016 IEEE 9th International Conference on Service-Oriented Computing and Applications (SOCA).

[11]  Charles Anderson,et al.  Docker , 2015, IEEE Softw..

[12]  Mohamed Mosbah,et al.  Fully-distributed Debugging and Visualization of Distributed Systems in Anonymous Networks , 2012, GRAPP/IVAPP.

[13]  Haifeng Chen,et al.  Fault detection and localization in distributed systems using invariant relationships , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[14]  Sumit Gulwani,et al.  Research for Practice: Tracing and Debugging Distributed Systems; Programming by Examples , 2017, ACM Queue.

[15]  Richard W. Vuduc,et al.  A Unified Approach for Localizing Non-deadlock Concurrency Bugs , 2012, 2012 IEEE Fifth International Conference on Software Testing, Verification and Validation.

[16]  Yong Wang,et al.  Overload Control for Scaling WeChat Microservices , 2018, SoCC.

[17]  Lucas C. Cordeiro,et al.  Fault Localization in Multi-threaded C Programs Using Bounded Model Checking , 2015, 2015 Brazilian Symposium on Computing Systems Engineering (SBESC).

[18]  Joseph B. Ottinger,et al.  Spring Boot , 2019, Beginning Spring 5.

[19]  Michael I. Jordan,et al.  Scalable statistical bug isolation , 2005, PLDI '05.

[20]  Fabrizio Montesi,et al.  Microservices: Yesterday, Today, and Tomorrow , 2017, Present and Ulterior Software Engineering.

[21]  Rui Abreu,et al.  A Survey on Software Fault Localization , 2016, IEEE Transactions on Software Engineering.

[22]  Wilhelm Hasselbring,et al.  Microservices for Scalability: Keynote Talk Abstract , 2016, ICPE.

[23]  Philippe Martin,et al.  Kubernetes , 2021 .

[24]  Harald C. Gall,et al.  Bifrost: Supporting Continuous Deployment with Automated Enactment of Multi-Phase Live Testing Strategies , 2016, Middleware.

[25]  Claus Pahl,et al.  Performance Engineering for Microservices: Research Challenges and Directions , 2017, ICPE Companion.

[26]  Jürgen Cito,et al.  Modelling and Managing Deployment Costs of Microservice-Based Cloud Applications , 2016, 2016 IEEE/ACM 9th International Conference on Utility and Cloud Computing (UCC).

[27]  Raúl A. Santelices,et al.  Quantitative program slicing: Separating statements by relevance , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[28]  Claus Pahl,et al.  Microservices: The Journey So Far and Challenges Ahead , 2018, IEEE Softw..

[29]  Scott Shenker,et al.  Verification in the Age of Microservices , 2017, HotOS.

[30]  Rui Abreu,et al.  GZoltar: an eclipse plug-in for testing and debugging , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[31]  Ludovico Iovino,et al.  MicroART: A Software Architecture Recovery Tool for Maintaining Microservice-Based Systems , 2017, 2017 IEEE International Conference on Software Architecture Workshops (ICSAW).

[32]  Frank Siqueira,et al.  Publishing linked data through semantic microservices composition , 2016, iiWAS.

[33]  Richard W. Vuduc,et al.  Falcon: fault localization in concurrent programs , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[34]  Daniel Sundmark,et al.  10 Years of research on debugging concurrent and multicore software: a systematic mapping study , 2016, Software Quality Journal.

[35]  Jyhjong Lin,et al.  Migrating web applications to clouds with microservice architectures , 2016, 2016 International Conference on Applied System Innovation (ICASI).

[36]  Claus Pahl,et al.  Benchmark Requirements for Microservices Architecture Research , 2017, 2017 IEEE/ACM 1st International Workshop on Establishing the Community-Wide Infrastructure for Architecture-Based Software Engineering (ECASE).

[37]  Amazon: , 2020, The Cost of Free Shipping.

[38]  Peter Zoeteweij,et al.  Spectrum-Based Multiple Fault Localization , 2009, 2009 IEEE/ACM International Conference on Automated Software Engineering.

[39]  Jun Sun,et al.  Benchmarking microservice systems for software engineering research , 2018, ICSE.

[40]  Bronis R. de Supinski,et al.  Probabilistic diagnosis of performance faults in large-scale parallel applications , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[41]  Vyas Sekar,et al.  Gremlin: Systematic Resilience Testing of Microservices , 2016, 2016 IEEE 36th International Conference on Distributed Computing Systems (ICDCS).

[42]  Rui Abreu,et al.  Spectrum-Based Fault Localization for Diagnosing Concurrency Faults , 2013, ICTSS.

[43]  Yu Qi,et al.  Bp Neural Network-Based Effective Fault Localization , 2009, Int. J. Softw. Eng. Knowl. Eng..