Searching similar segments over textual event sequences

Sequential data is prevalent in many scientific and commercial applications such as bioinformatics, system security and networking. Similarity search has been widely studied for symbolic and time series data in which each data object is a symbol or numeric value. Textual event sequences are sequences of events, where each object is a message describing an event. For example, system logs are typical textual event sequences and each event is a textual message recording internal system operations, statuses, configuration modifications or execution errors. Similar segments of an event sequence reveals similar system behaviors in the past which are helpful for system administrators to diagnose system problems. Existing search indexing for textual data only focus on unordered data. Substring matching methods are able to efficiently find matched segments over a sequence, however, their sequences are single values rather than texts. In this paper, we propose a method, suffix matrix, for efficiently searching similar segments over textual event sequences. It provides an integration of two disparate techniques: locality-sensitive hashing and suffix arrays. This method also supports the k-dissimilar segment search. A k-dissimilar segment is a segment that has at most k dissimilar events to the query sequence. By using random sequence mask proposed in this paper, this method can have a high probability to reach all k-dissimilar segments without increasing much search cost. We conduct experiments on real system log data and the experimental results show that our proposed method outperforms alternative methods using existing techniques.

[1]  Mihai Pop,et al.  Inexact Local Alignment Search over Suffix Arrays , 2009, 2009 IEEE International Conference on Bioinformatics and Biomedicine.

[2]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[3]  Liang Tang,et al.  An integrated framework for optimizing automatic monitoring systems in large IT infrastructures , 2013, KDD.

[4]  Tao Li,et al.  LogSig: generating system events from raw textual logs , 2011, CIKM '11.

[5]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[6]  Evangelos E. Milios,et al.  Clustering event logs using iterative partitioning , 2009, KDD.

[7]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[8]  Mayank Bawa,et al.  LSH forest: self-tuning indexes for similarity search , 2005, WWW '05.

[9]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[10]  Benno Stein Principles of hash-based text retrieval , 2007, SIGIR.

[11]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[12]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[13]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[14]  Ling Huang,et al.  Mining Console Logs for Large-Scale System Problem Detection , 2008, SysML.

[15]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[16]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[17]  Liang Tang,et al.  Recommending resolutions for problems identified by monitoring , 2013, 2013 IFIP/IEEE International Symposium on Integrated Network Management (IM 2013).

[18]  John Riedl,et al.  Generalized suffix trees for biological sequence data: applications and implementation , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[19]  Liang Tang,et al.  LogTree: A Framework for Generating System Events from Raw Textual Logs , 2010, 2010 IEEE International Conference on Data Mining.

[20]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[21]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[22]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[23]  Liang Tang,et al.  Optimizing system monitoring configurations for non-actionable alerts , 2012, 2012 IEEE Network Operations and Management Symposium.

[24]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[25]  Renée J. Miller,et al.  Similarity search over time-series data using wavelets , 2002, Proceedings 18th International Conference on Data Engineering.

[26]  Jignesh M. Patel,et al.  WHAM: A High-Throughput Sequence Alignment Method , 2011, TODS.

[27]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[28]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[29]  Ling Huang,et al.  Large-Scale System Problems Detection by Mining Console Logs , 2009 .

[30]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[31]  Alexander Aiken,et al.  Alert Detection in System Logs , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[32]  Srinivas Aluru,et al.  Space efficient linear time construction of suffix arrays , 2003, J. Discrete Algorithms.

[33]  Liang Tang,et al.  Discovering lag intervals for temporal dependencies , 2012, KDD.

[34]  Jon Louis Bentley,et al.  K-d trees for semidynamic point sets , 1990, SCG '90.

[35]  Xiang Lian,et al.  Efficient Similarity Search over Future Stream Time Series , 2008, IEEE Transactions on Knowledge and Data Engineering.