Various data mining approaches are currently being used to analyse data within different domains. Among all these approaches, clustering is one of the most-used approaches, which is typically adopted in order to group data based on their similarities. The data in various systems such as finance, healthcare, and business, are stored as time-series. Clustering such complex data can discover patterns which have valuable information. Time-series clustering is not only useful as an exploratory technique but also as a subroutine in more complex data mining algorithms. As a result, time-series clustering (as a part of temporal data mining research) has attracted increasing interest for use in various areas such as medicine, biology, finance, economics, and in the Web.
Several studies which focus on time-series clustering have been conducted in said areas. Many of these studies focus on the time complexity of time-series clustering in large datasets and utilize dimensionality reduction approaches and conventional clustering algorithms to address the problem. However, as is the case in many systems, conventional clustering approaches are not practical for time-series data because they are essentially designed for static data and not for time-series data, which leads to poor clustering accuracy. Adequate clustering approaches for time-series are therefore lacking.
In this thesis, the problem of the low quality in existing works is taken into account, and a new multi-step clustering model is proposed. This model facilitates the accurate clustering of time-series datasets and is designed specifically for very large time-series datasets. It overcomes the limitations of conventional clustering algorithms in dealing with time-series data.
In the first step of the model, data is pre-processed, represented by symbolic aggregate approximation, and grouped approximately by a novel approach. Then, the groups are refined in the second step by using an accurate clustering method, and a representative is defined for each cluster. Finally, the representatives are merged to construct the ultimate clusters. The model is then extended as an interactive model where the results garnered by the user increase in accuracy over time. In this work, the accurate clustering based on shape similarity is performed. It is shown that clustering of time-series does not need to calculate the exact distances/similarity between all time-series in a dataset; instead, by using prototypes of similar time-series, accurate clusters can be obtained.
To evaluate its accuracy, the proposed model is tested extensively by using published time-series datasets from diverse domains. This model is more accurate than any existing work and is also scalable (on large datasets) due to the use of multi-resolution of time-series in different levels of clustering. Moreover, it provides a clear understanding of the domains by its ability to generate hierarchical and arbitrary shape clusters of time-series data.