Automatically and Adaptively Identifying Severe Alerts for Online Service Systems

In large-scale online service system, to enhance the quality of services, engineers need to collect various monitoring data and write many rules to trigger alerts. However, the number of alerts is way more than what on-call engineers can properly investigate. Thus, in practice, alerts are classified into several priority levels using manual rules, and on-call engineers primarily focus on handling the alerts with the highest priority level (i.e., severe alerts). Unfortunately, due to the complex and dynamic nature of the online services, this rule-based approach results in missed severe alerts or wasted troubleshooting time on non-severe alerts. In this paper, we propose AlertRank, an automatic and adaptive framework for identifying severe alerts. Specifically, AlertRank extracts a set of powerful and interpretable features (textual and temporal alert features, univariate and multivariate anomaly features for monitoring metrics), adopts XGBoost ranking algorithm to identify the severe alerts out of all incoming alerts, and uses novel methods to obtain labels for both training and testing. Experiments on the datasets from a top global commercial bank demonstrate that AlertRank is effective and achieves the F1-score of 0.89 on average, outperforming all baselines. The feedback from practice shows AlertRank can significantly save the manual efforts for on-call engineers.

[1]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[2]  Ding Li,et al.  NoDoze: Combatting Threat Alert Fatigue with Automated Provenance Triage , 2019, NDSS.

[3]  Ping Wang,et al.  Lightweight and Adaptive Service API Performance Monitoring in Highly Dynamic Cloud Environment , 2017, 2017 IEEE International Conference on Services Computing (SCC).

[4]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[5]  Liang Tang,et al.  Optimizing system monitoring configurations for non-actionable alerts , 2012, 2012 IEEE Network Operations and Management Symposium.

[6]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[7]  Qing Wang,et al.  STAR: A System for Ticket Analysis and Resolution , 2017, KDD.

[8]  Haifeng Chen,et al.  Ranking the importance of alerts for problem determination in large computer systems , 2011, Cluster Computing.

[9]  Xiaohui Yan,et al.  A biterm topic model for short texts , 2013, WWW.

[10]  Regunathan Radhakrishnan,et al.  Unveiling clusters of events for alert and incident management in large-scale enterprise it , 2014, KDD.

[11]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[12]  Tao Zhang,et al.  Towards more accurate severity prediction and fixer recommendation of software bugs , 2016, J. Syst. Softw..

[13]  Ehab Al-Shaer,et al.  Alert prioritization in Intrusion Detection Systems , 2008, NOMS 2008 - 2008 IEEE Network Operations and Management Symposium.

[14]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[15]  Dan Pei,et al.  Opprentice: Towards Practical and Automatic Anomaly Detection Through Machine Learning , 2015, Internet Measurement Conference.

[16]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[17]  Yang Feng,et al.  Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications , 2018, WWW.

[18]  Dan Pei,et al.  Label-Less: A Semi-Automatic Labelling Tool for KPI Anomalies , 2019, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[19]  Philip S. Yu,et al.  On Periodicity Detection and Structural Periodic Similarity , 2005, SDM.

[20]  Valentino Constantinou,et al.  Detecting Spacecraft Anomalies Using LSTMs and Nonparametric Dynamic Thresholding , 2018, KDD.

[21]  Tie-Yan Liu Learning to Rank for Information Retrieval , 2009, Found. Trends Inf. Retr..

[22]  Christopher J. C. Burges,et al.  From RankNet to LambdaRank to LambdaMART: An Overview , 2010 .

[23]  Shwetabh Khanduja,et al.  Learning a Hierarchical Monitoring System for Detecting and Diagnosing Service Issues , 2015, KDD.

[24]  Michael Röder,et al.  Exploring the Space of Topic Coherence Measures , 2015, WSDM.

[25]  Gabriel Maciá-Fernández,et al.  A model-based survey of alert correlation techniques , 2013, Comput. Networks.

[26]  Dan Pei,et al.  Automatic and Generic Periodicity Adaptation for KPI Anomaly Detection , 2019, IEEE Transactions on Network and Service Management.

[27]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[28]  Shenglin Zhang,et al.  Syslog processing for switch failure diagnosis and prediction in datacenter networks , 2017, 2017 IEEE/ACM 25th International Symposium on Quality of Service (IWQoS).

[29]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[30]  Rasool Jalili,et al.  Alert Correlation Algorithms: A Survey and Taxonomy , 2013, CSS.

[31]  Dongmei Zhang,et al.  Predicting Node failure in cloud service systems , 2018, ESEC/SIGSOFT FSE.

[32]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[33]  Hui Liu,et al.  Emotion Based Automated Priority Prediction for Bug Reports , 2018, IEEE Access.

[34]  Wei Cheng,et al.  Collaborative Alert Ranking for Anomaly Detection , 2016, CIKM.

[35]  Wei Wang,et al.  An Alert Aggregation Algorithm Based on Iterative Self-Organization , 2012 .

[36]  Yu Zhang,et al.  Log Clustering Based Problem Identification for Online Service Systems , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C).

[37]  Niloy Ganguly,et al.  ADELE: Anomaly Detection from Event Log Empiricism , 2018, IEEE INFOCOM 2018 - IEEE Conference on Computer Communications.