Debugging OpenStack Problems Using a State Graph Approach

It is hard to operate and debug systems like OpenStack that integrate many independently developed modules with multiple levels of abstractions. A major challenge is to navigate through the complex dependencies and relationships of the states in different modules or subsystems, to ensure the correctness and consistency of these states. We present a system that captures the runtime states and events from the entire OpenStack-Ceph stack, and automatically organizes these data into a graph that we call system operation state graph (SOSG). With SOSG we can use intuitive graph traversal techniques to solve problems like reasoning about the state of a virtual machine. Also, using a graph-based anomaly detection, we can automatically discover hidden problems in OpenStack. We have a scalable implementation of SOSG, and evaluate the approach on a 125-node production OpenStack cluster, finding a number of interesting problems.

[1]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[2]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[3]  Paola Batistoni,et al.  International Conference , 2001 .

[4]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[5]  Diane J. Cook,et al.  Graph-based anomaly detection , 2003, KDD '03.

[6]  Philip S. Yu,et al.  PathSim , 2011, Proc. VLDB Endow..

[7]  Song Han,et al.  Highly Efficient Distance-Based Anomaly Detection through Univariate with PCA in Wireless Sensor Networks , 2011, 2011IEEE 10th International Conference on Trust, Security and Privacy in Computing and Communications.

[8]  Rodrigo Fonseca,et al.  Pivot tracing , 2018, USENIX ATC.

[9]  Dipankar Dasgupta,et al.  Anomaly detection in multidimensional data using negative selection algorithm , 2002, Proceedings of the 2002 Congress on Evolutionary Computation. CEC'02 (Cat. No.02TH8600).

[10]  Jeanna N. Matthews,et al.  Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles , 2009, SOSP'09 2009.

[11]  Bhavani M. Thuraisingham,et al.  Statistical technique for online anomaly detection using Spark over heterogeneous data from multi-source VMware performance data , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[12]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[13]  Hong Wen,et al.  Bayesian Statistical Inference in Machine Learning Anomaly Detection , 2010, 2010 International Conference on Communications and Intelligence Information Security.

[14]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[15]  Gerhard Weikum,et al.  YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia: Extended Abstract , 2013, IJCAI.

[16]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[17]  Jim Webber,et al.  Graph Databases: New Opportunities for Connected Data , 2013 .

[18]  Danai Koutra,et al.  Graph based anomaly detection and description: a survey , 2014, Data Mining and Knowledge Discovery.

[19]  Vanish Talwar,et al.  Statistical techniques for online anomaly detection in data centers , 2011, 12th IFIP/IEEE International Symposium on Integrated Network Management (IM 2011) and Workshops.

[20]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[21]  Fabian M. Suchanek,et al.  YAGO3: A Knowledge Base from Multilingual Wikipedias , 2015, CIDR.

[22]  Nguyen Lu Dang Khoa,et al.  Network Anomaly Detection Using a Commute Distance Based Approach , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[23]  Laks V. S. Lakshmanan,et al.  Proceedings of the 2008 ACM SIGMOD international conference on Management of data , 2008, SIGMOD 2008.

[24]  Tom M. Mitchell,et al.  Learning to construct knowledge bases from the World Wide Web , 2000, Artif. Intell..

[25]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[26]  Gerhard Weikum,et al.  From information to knowledge: harvesting entities and relationships from web sources , 2010, PODS '10.

[27]  Lin Tan,et al.  Finding patterns in static analysis alerts: improving actionable alert ranking , 2014, MSR 2014.

[28]  Mona Attariyan,et al.  X-ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software , 2012, OSDI.

[29]  Gregory R. Ganger,et al.  Diagnosing Performance Changes by Comparing Request Flows , 2011, NSDI.

[30]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[31]  Vldb Endowment,et al.  The VLDB journal : the international journal on very large data bases. , 1992 .

[32]  Stefan Savage,et al.  California fault lines: understanding the causes and impact of network failures , 2010, SIGCOMM '10.

[33]  Hisashi Kashima,et al.  Eigenspace-based anomaly detection in computer systems , 2004, KDD.

[34]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[35]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[36]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[37]  Ding Yuan,et al.  SherLog: error diagnosis by connecting clues from run-time logs , 2010, ASPLOS XV.

[38]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.