Juggling the Jigsaw: Towards Automated Problem Inference from Network Trouble Tickets

This paper presents NetSieve, a system that aims to do automated problem inference from network trouble tickets. Network trouble tickets are diaries comprising fixed fields and free-form text written by operators to document the steps while troubleshooting a problem. Unfortunately, while tickets carry valuable information for network management, analyzing them to do problem inference is extremely difficult--fixed fields are often inaccurate or incomplete, and the free-form text is mostly written in natural language. This paper takes a practical step towards automatically analyzing natural language text in network tickets to infer the problem symptoms, troubleshooting activities and resolution actions. Our system, NetSieve, combines statistical natural language processing (NLP), knowledge representation, and ontology modeling to achieve these goals. To cope with ambiguity in free-form text, NetSieve leverages learning from human guidance to improve its inference accuracy. We evaluate NetSieve on 10K+ tickets from a large cloud provider, and compare its accuracy using (a) an expert review, (b) a study with operators, and (c) vendor data that tracks device replacement and repairs. Our results show that NetSieve achieves 89%-100% accuracy and its inference output is useful to learn global problem trends. We have used NetSieve in several key network operations: analyzing device failure trends, understanding why network redundancy fails, and identifying device problem symptoms.

[1]  Armando Fox,et al.  Capturing, indexing, clustering, and retrieving system history , 2005, SOSP '05.

[2]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[3]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[4]  Ranveer Chandra,et al.  What's going on?: learning communication rules in edge networks , 2008, SIGCOMM '08.

[5]  Saurabh Bagchi,et al.  Automated Rule-Based Diagnosis Through a Distributed Monitor System , 2007, IEEE Transactions on Dependable and Secure Computing.

[6]  Robert Richards,et al.  Representational State Transfer (REST) , 2006 .

[7]  Westley Weimer,et al.  Modeling bug report quality , 2007, ASE '07.

[8]  Dale S. Johnson,et al.  NOC Internal Integrated Trouble Ticket System Functional Specification Wishlist ("NOC TT REQUIREMENTS") , 1992, RFC.

[9]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[10]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[11]  Cristina Melchiors,et al.  Troubleshooting network faults using past experience , 2000, NOMS 2000. 2000 IEEE/IFIP Network Operations and Management Symposium 'The Networked Planet: Management Beyond 2000' (Cat. No.00CB37074).

[12]  Inderjeet Mani,et al.  The Tipster Summac Text Summarization Evaluation , 1999, EACL.

[13]  Daniel S. Weld,et al.  Open Information Extraction Using Wikipedia , 2010, ACL.

[14]  Matthew Roughan,et al.  IP forwarding anomalies and improving their detection using multiple data sources , 2004, NetT '04.

[15]  N. F. Noy,et al.  Ontology Development 101: A Guide to Creating Your First Ontology , 2001 .

[16]  Navendu Jain,et al.  Understanding network failures in data centers: measurement, analysis, and implications , 2011, SIGCOMM.

[17]  Kenji Yamanishi,et al.  Dynamic syslog mining for network failure monitoring , 2005, KDD '05.

[18]  Kenneth Ward Church,et al.  Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus , 2001, Computational Linguistics.

[19]  Jianfeng Gao,et al.  Extraction of Chinese Compound Words - An Experimental Study on a Very Large Corpus , 2000, ACL 2000.

[20]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[21]  Jade Goldstein-Stewart,et al.  Summarizing text documents: sentence selection and evaluation metrics , 1999, SIGIR '99.

[22]  M. Benson,et al.  Collocations and General-purpose Dictionaries , 1990 .

[23]  Ding Yuan,et al.  SherLog: error diagnosis by connecting clues from run-time logs , 2010, ASPLOS XV.

[24]  Long H. Ngo,et al.  Implementation and Evaluation of Four Different Methods of Negation Detection , 2007 .

[25]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[26]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[27]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[28]  Stefan Savage,et al.  California fault lines: understanding the causes and impact of network failures , 2010, SIGCOMM '10.

[29]  Aniket Kittur,et al.  Crowdsourcing user studies with Mechanical Turk , 2008, CHI.

[30]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[31]  Makoto Nagao,et al.  A New Method of N-gram Statistics for Large Number of n and Automatic Extraction of Words and Phrases from Large Text Data of Japanese , 1994, COLING.

[32]  Navendu Jain,et al.  Understanding network failures in data centers , 2011, SIGCOMM 2011.

[33]  Thomas Zimmermann,et al.  Extracting structural information from bug reports , 2008, MSR '08.

[34]  Thomas R. Gruber,et al.  Toward principles for the design of ontologies used for knowledge sharing? , 1995, Int. J. Hum. Comput. Stud..

[35]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[36]  Slava M. Katz,et al.  Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[37]  Dan Pei,et al.  What happened in my network: mining network events from router syslogs , 2010, IMC '10.

[38]  Kavé Salamatian,et al.  Anomaly extraction in backbone networks using association rules , 2009, IMC '09.

[39]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[40]  Nick Feamster,et al.  Diagnosing network disruptions with network-wide analysis , 2007, SIGMETRICS '07.

[41]  Renata Teixeira,et al.  TroubleMiner: Mining network trouble tickets , 2009, 2009 IFIP/IEEE International Symposium on Integrated Network Management-Workshops.

[42]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[43]  Suresh Venkatasubramanian,et al.  Streaming for large scale NLP: Language Modeling , 2009, NAACL.

[44]  Navjot Singh,et al.  A log mining approach to failure analysis of enterprise telephony systems , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[45]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[46]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL 2006.

[47]  Chris Buckley,et al.  Automatic Text Summarization by Paragraph Extraction , 1997 .

[48]  Thomas Zimmermann,et al.  Duplicate bug reports considered harmful … really? , 2008, 2008 IEEE International Conference on Software Maintenance.

[49]  Farnam Jahanian,et al.  Experimental study of Internet stability and backbone failures , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[50]  Per Runeson,et al.  Detection of Duplicate Defect Reports Using Natural Language Processing , 2007, 29th International Conference on Software Engineering (ICSE'07).

[51]  Van-Anh Truong,et al.  Availability in Globally Distributed Storage Systems , 2010, OSDI.

[52]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[53]  Thomas Zimmermann,et al.  Towards the next generation of bug tracking systems , 2008, 2008 IEEE Symposium on Visual Languages and Human-Centric Computing.

[54]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[55]  Roberto Manione,et al.  An Expert System for Real Time Fault Diagnosis of the Italian Telecommunications Network , 1993, Integrated Network Management.