Size Matters: Finding the Most Informative Set of Window Lengths

Event sequences often contain continuous variability at different levels. In other words, their properties and characteristics change at different rates, concurrently. For example, the sales of a product may slowly become more frequent over a period of several weeks, but there may be interesting variation within a week at the same time. To provide an accurate and robust "view" of such multi-level structural behavior, one needs to determine the appropriate levels of granularity for analyzing the underlying sequence. We introduce the novel problem of finding the best set of window lengths for analyzing discrete event sequences. We define suitable criteria for choosing window lengths and propose an efficient method to solve the problem. We give examples of tasks that demonstrate the applicability of the problem and present extensive experiments on both synthetic data and real data from two domains: text and DNA. We find that the optimal sets of window lengths themselves can provide new insight into the data, e.g., the burstiness of events affects the optimal window lengths for measuring the event frequencies.

[1]  Wing-Kin Sung,et al.  Association mapping via regularized regression analysis of single-nucleotide-polymorphism haplotypes in variable-sized sliding windows. , 2007, American journal of human genetics.

[2]  E. Génin,et al.  Search for multifactorial disease susceptibility genes in founder populations , 2000, Annals of human genetics.

[3]  Panagiotis Papapetrou,et al.  Analyzing Word Frequencies in Large Text Corpora Using Inter-arrival Times and Bootstrapping , 2011, ECML/PKDD.

[4]  Adilson E. Motter,et al.  Beyond Word Frequency: Bursts, Lulls, and Scaling in the Temporal Distributions of Words , 2009, PloS one.

[5]  A. Akhmetova Discovery of Frequent Episodes in Event Sequences , 2006 .

[6]  Jean Ponce,et al.  Computer Vision: A Modern Approach , 2002 .

[7]  Erik D. Demaine,et al.  Frequency Estimation of Internet Packet Streams with Limited Space , 2002, ESA.

[8]  H. K. Dai,et al.  A survey of DNA motif finding algorithms , 2007, BMC Bioinformatics.

[9]  Panagiotis Papapetrou,et al.  Discovering Frequent Poly-Regions in DNA Sequences , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[10]  Leif Sörnmo,et al.  Chapter 7 – ECG Signal Processing , 2005 .

[11]  S. Gries Dispersions and adjusted frequencies in corpora , 2008 .

[12]  S Karlin,et al.  Genome-scale compositional comparisons in eukaryotes. , 2001, Genome research.

[13]  Qiuying Sha,et al.  A Variable‐Sized Sliding‐Window Approach for Genetic Association Studies via Principal Component Analysis , 2009, Annals of human genetics.

[14]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[15]  J. Kere,et al.  Data mining applied to linkage disequilibrium mapping. , 2000, American journal of human genetics.

[16]  Wonsuk Lee,et al.  Finding maximal frequent itemsets over online data streams adaptively , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[17]  Pablo Laguna,et al.  Bioelectrical Signal Processing in Cardiac and Neurological Applications , 2005 .

[18]  Ruoming Jin,et al.  An algorithm for in-core frequent itemset mining on streaming data , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[19]  Eamonn J. Keogh,et al.  Probabilistic discovery of time series motifs , 2003, KDD '03.

[20]  David Kauchak,et al.  Modeling word burstiness using the Dirichlet distribution , 2005, ICML.

[21]  Stefan Evert,et al.  How Random is a Corpus? The Library Metaphor , 2006 .

[22]  Erik D. Demaine,et al.  Identifying frequent items in sliding windows over on-line packet streams , 2003, IMC '03.

[23]  E. Kirkness,et al.  The Dog Genome: Survey Sequencing and Comparative Analysis , 2003, Science.

[24]  Terri H Beaty,et al.  A graphical assessment of p-values from sliding window haplotype tests of association to identify asthma susceptibility loci on chromosome 11q , 2006, BMC Genetics.

[25]  Toon Calders,et al.  Mining frequent items in a stream using flexible windows , 2008, Intell. Data Anal..

[26]  Richard M. Karp,et al.  A simple algorithm for finding frequent elements in streams and bags , 2003, TODS.

[27]  Edward L. Robertson,et al.  Mining Frequent Itemsets Over Arbitrary Time Intervals in Data Streams , 2003 .

[28]  Bin Wang,et al.  VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams , 2007, VLDB.

[29]  Long Jin,et al.  Mining Frequent Itemsets over Data Streams with Multiple Time-Sensitive Sliding Windows , 2007, Sixth International Conference on Advanced Language Processing and Web Information Technology (ALPIT 2007).

[30]  Philip S. Yu,et al.  Optimal multi-scale patterns in time series streams , 2006, SIGMOD Conference.

[31]  Douglas Biber,et al.  Variation across speech and writing: Methodology , 1988 .

[32]  Eamonn J. Keogh,et al.  Exact Discovery of Time Series Motifs , 2009, SDM.

[33]  Slava M. Katz Distribution of content words and phrases in text and language modelling , 1996, Natural Language Engineering.