Partitional clustering of tick data to reduce storage space

Tick data is one of the most prominent types of temporal data, as it can be used to represent data in various domains such as geophysics or finance. Storage of tick data is a challenging problem because two criteria have to be fulfilled simultaneously: the storage structure should allow fast execution of queries and the data should not occupy too much space on the hard disk or in the main memory. In this paper, we present a clustering-based solution, and we introduce a new clustering algorithm, SOPAC, that is designed to support the storage of tick data. Our approach is based on the search for a partitional clustering that optimizes storage space. We evaluate our algorithm both on publicly available real-world datasets, as well as real-world tick data from the financial domain. We also investigate on task-specific benchmarks, how well our approach estimates the optimum. Our experiments show that, for the tick data storage problem, our algorithm substantially outperforms - both in terms of statistical significance and practical relevance - state-of-the-art clustering algorithms.

[1]  Kazuyuki Aihara,et al.  Statistical properties of the moving average price in dollar–yen exchange rates , 2004 .

[2]  Krisztian Buza,et al.  A Distributed Genetic Algorithm for Graph-Based Clustering , 2011, ICMMI.

[3]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Misako Takayasu,et al.  Transaction Interval Analysis of High Resolution Foreign Exchange Data , 2002 .

[5]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[6]  Naoya Sazuka Analysis of binarized high frequency financial data , 2006 .

[7]  Rosario Bartiromo,et al.  Dynamics of stock prices. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[8]  Kyong Joo Oh,et al.  Analyzing Stock Market Tick Data Using Piecewise Nonlinear Model , 2022 .

[9]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[10]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[11]  David Salomon,et al.  Data Compression: The Complete Reference , 2006 .

[12]  Bin Zhou,et al.  High Frequency Data and Volatility in Foreign Exchange Rates , 2013 .

[13]  Pierre Duchesne,et al.  Intraday Value at Risk (IVaR) Using Tick-by-Tick Data with Application to the Toronto Stock Exchange , 2009 .

[14]  Pierre Duchesne,et al.  Intraday Value at Risk (Ivar) Using Tick-by-Tick Data with Application to the Toronto Stock Exchange , 2005 .