Storage-optimizing clustering algorithms for high-dimensional tick data

Tick data are used in several applications that need to keep track of values changing over time, like prices on the stock market or meteorological measurements. Due to the possibly very frequent changes, the size of tick data tends to increase rapidly. Therefore, it becomes of paramount importance to reduce the storage space of tick data while, at the same time, allowing queries to be executed efficiently. In this paper, we propose an approach to decompose the original tick data matrix by clustering their attributes using a new clustering algorithm called Storage-Optimizing Hierarchical Agglomerative Clustering (SOHAC). We additionally propose a method for speeding up SOHAC based on a new lower bounding technique that allows SOHAC to be applied to high-dimensional tick data. Our experimental evaluation shows that the proposed approach compares favorably to several baselines in terms of compression. Additionally, it can lead to significant speedup in terms of running time.

[1]  Ian Witten,et al.  Data Mining , 2000 .

[2]  Petra Perner,et al.  Advances in Data Mining , 2002, Lecture Notes in Computer Science.

[3]  Krisztian Buza,et al.  SOHAC: Efficient Storage of Tick Data That Supports Search and Analysis , 2012, ICDM.

[4]  Misako Takayasu,et al.  Transaction Interval Analysis of High Resolution Foreign Exchange Data , 2002 .

[5]  Myra Spiliopoulou,et al.  Spectral Clustering in Social-Tagging Systems , 2009, WISE.

[6]  Tian Qiu,et al.  Dynamics of bid–ask spread return and volatility of the Chinese stock market , 2011, 1110.4455.

[7]  Zhengxiao Wu,et al.  On the intraday periodicity duration adjustment of high-frequency data , 2012 .

[8]  Dagfinn Rime,et al.  Does the law of one price hold in international financial markets? Evidence from tick data , 2009 .

[9]  E. Barany,et al.  Detecting market crashes by analysing long-memory effects using high-frequency data , 2012 .

[10]  Rosario Bartiromo,et al.  Dynamics of stock prices. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[11]  Kyong Joo Oh,et al.  Analyzing Stock Market Tick Data Using Piecewise Nonlinear Model , 2022 .

[12]  Terry Ngo,et al.  Data mining: practical machine learning tools and technique, third edition by Ian H. Witten, Eibe Frank, Mark A. Hell , 2011, SOEN.

[13]  Philip Protter,et al.  Signing trades and an evaluation of the Lee–Ready algorithm , 2012 .

[14]  Álvaro Cartea,et al.  Derivatives pricing with marked point processes using tick-by-tick data , 2010 .

[15]  Pierre Duchesne,et al.  Intraday Value at Risk (Ivar) Using Tick-by-Tick Data with Application to the Toronto Stock Exchange , 2005 .

[16]  Tugba Taskaya-Temizel,et al.  Summarizing Time Series: Learning Patterns in 'Volatile' Series , 2004, IDEAL.

[17]  Naoya Sazuka Analysis of binarized high frequency financial data , 2006 .

[18]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[19]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[20]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[21]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[22]  David Salomon,et al.  Data Compression: The Complete Reference , 2006 .

[23]  Kazuyuki Aihara,et al.  Statistical properties of the moving average price in dollar–yen exchange rates , 2004 .

[24]  Krisztian Buza,et al.  A Distributed Genetic Algorithm for Graph-Based Clustering , 2011, ICMMI.

[25]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Yoshiyuki Yabuuchi,et al.  Formulation of Possibility Grade-Based Fuzzy Autocorrelation Model and Its Application to Forecasting , 2012 .