今日推荐

2012 - KDD : proceedings. International Conference on Knowledge Discovery & Data Mining

Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping

Most time series data mining algorithms use similarity search as a core subroutine, and thus the time taken for similarity search is the bottleneck for virtually all time series data mining algorithms. The difficulty of scaling search to large datasets largely explains why most academic work on time series data mining has plateaued at considering a few millions of time series objects, while much of industry and science sits on billions of time series objects waiting to be explored. In this work we show that by using a combination of four novel ideas we can search and mine truly massive time series for the first time. We demonstrate the following extremely unintuitive fact; in large datasets we can exactly search under DTW much more quickly than the current state-of-the-art Euclidean distance search algorithms. We demonstrate our work on the largest set of time series experiments ever attempted. In particular, the largest dataset we consider is larger than the combined size of all of the time series datasets considered in all data mining papers ever published. We show that our ideas allow us to solve higher-level time series data mining problem such as motif discovery and clustering at scales that would otherwise be untenable. In addition to mining massive datasets, we will show that our ideas also have implications for real-time monitoring of data streams, allowing us to handle much faster arrival rates and/or use cheaper and lower powered devices than are currently possible.

1997 - KDD

A Probabilistic Approach to Fast Pattern Matching in Time Series Databases

The problem of efficiently and accurately locating patterns of interest in massive time series data sets is an important and non-trivial problem in a wide variety of applications, including diagnosis and monitoring of complex systems, biomedical data analysis, and exploratory data analysis in scientific and business time series. In this paper a probabilistic approach is taken to this problem. Using piecewise linear segmentations as the underlying representation, local features (such as peaks, troughs, and plateaus) are defined using a prior distribution on expected deformations from a basic template. Global shape information is represented using another prior on the relative locations of the individual features. An appropriately defined probabilistic model integrates the local and global information and directly leads to an overall distance measure between sequence patterns based on prior knowledge. A search algorithm using this distance measure is shown to efficiently and accurately find matches for a variety of patterns on a number of data sets, including engineering sensor data from space Shuttle mission archives. The proposed approach provides a natural framework to support user-customizable "query by content" on time series data, taking prior domain information into account in a principled manner.

2000 - VLDB

Identifying Representative Trends in Massive Time Series Data Sets Using Sketches

Many data stores, including scientific and financial databases, business warehouses and network repositories, contain time series data. Time series data depict trends for an observed value e.g., value of a stock, number of bytes sent on a router interface, etc., as a function of time. Analysis of the trends over different time windows is of great interest. In this paper, we formalize problems of identifying various 'representative' trends in time series data. Informally, an interval of observations in a time series is defined to be a representative trend if its distance from other intervals satisfy certain properties, for suitably defined distance functions between time series intervals. Natural trends of interest such as periodic or average trends are examples of representative trends. We present efficient algorithms for analyzing massive time series data sets for representative trends over arbitrary windows of interest. Our algorithms are highly processor and 10 efficient; they are approximate but provide probabilistic guarantees for the approximations achieved. Our approach for identifying representative trends relies on a dimensionality reduction technique that replaces each interval by a 'sketch' which is a low dimensional vector. We present efficient algorithms to construct such sketches using a pool of select sketches that we precompute using polynomial convolutions. Using such sketches, we can compute representative trends accurately. Finally, we present results of a detailed experimental study of our technique on very large real data sets. Our results show that, compared to approaches that determine representative trends exactly, our approach shows significant performance gains with only a small loss in accuracy.

2005 - Information Visualization

Visualizing and Discovering Non-Trivial Patterns in Large Time Series Databases

Data visualization techniques are very important for data analysis, since the human eye has been frequently advocated as the ultimate data-mining tool. However, there has been surprisingly little work on visualizing massive time series data sets. To this end, we developed VizTree, a time series pattern discovery and visualization system based on augmenting suffix trees. VizTree visually summarizes both the global and local structures of time series data at the same time. In addition, it provides novel interactive solutions to many pattern discovery problems, including the discovery of frequently occurring patterns (motif discovery), surprising patterns (anomaly detection), and query by content. VizTree works by transforming the time series into a symbolic representation, and encoding the data in a modified suffix tree in which the frequency and other properties of patterns are mapped onto colors and other visual properties. We demonstrate the utility of our system by comparing it with state-of-the-art batch algorithms on several real and synthetic data sets. Based on the tree structure, we further device a coefficient which measures the dissimilarity between any two time series. This coefficient is shown to be competitive with the well-known Euclidean distance.

论文关键词

neural network sensor network machine learning artificial neural network support vector machine deep learning time series data mining support vector vector machine wavelet transform data analysi deep neural network neural network model hidden markov model regression model deep neural anomaly detection gene expression data base generative adversarial network generative adversarial time series datum adversarial network experimental datum fourier series nearest neighbor support vector regression time series analysi missing datum data based moving average gene expression datum time series model series analysi lyapunov exponent series datum outlier detection dynamic time warping time series forecasting data mining algorithm panel datum time series prediction series model multivariate time series finite time unit root dynamic time linear and nonlinear series forecasting time warping distance measure financial time series series prediction integrated moving average experimental comparison multivariate time financial time dependent variable chaotic time series nonlinear time vegetation index nonlinear time series arima model fuzzy time large time anomaly detection method fuzzy time series chaotic time autoregressive integrated moving time series based air pollutant time series classification representation method fokker-planck equation series representation similarity analysi series classification univariate time series time series clustering unsupervised anomaly detection periodic pattern nearest neighbor classification time series dataset series data mining time series regression anomaly detection approach time series database series clustering observed time series forecasting time series local similarity long time series time series similarity series database fmri time series complex time indian stock market time series representation symbolic aggregate approximation complex time series forecasting time series data set series similarity fmri time time series anomaly large time series series data analysi series anomaly detection analyzing time series expression time series interrupted time series ucr time series time correction modeling time series clustering time series mining time series interrupted time series data based fourier series representation simple exponential smoothing early classification forecast time series time series subsequence sensor networks pose distributed index piecewise constant approximation quality time series mining time microarray time series incomplete time series massive time series large-scale time series analysing time series microarray time neural time series mri time neural time series data generated time series experiment visualizing time series called time series data set