OSTSC: Over Sampling for Time Series Classification in R

The OSTSC package is a powerful oversampling approach for classifying univariant, but multinomial time series data in R. This vignette provides a brief overview of the oversampling methodology implemented by the package. A tutorial of the OSTSC package is provided. We begin by providing three test cases for the user to quickly validate the functionality in the package. To demonstrate the performance impact of OSTSC, we then provide two medium size imbalanced time series datasets. Each example applies a TensorFlow implementation of a Long Short-Term Memory (LSTM) classifier - a type of a Recurrent Neural Network (RNN) classifier - to imbalanced time series. The classifier performance is compared with and without oversampling. Finally, larger versions of these two datasets are evaluated to demonstrate the scalability of the package. The examples demonstrate that the OSTSC package improves the performance of RNN classifiers applied to highly imbalanced time series data. In particular, OSTSC is observed to increase the AUC of LSTM from 0.543 to 0.784 on a high frequency trading dataset consisting of 30,000 time series observations.

[1]  See-Kiong Ng,et al.  Integrated Oversampling for Imbalanced Time Series Classification , 2013, IEEE Transactions on Knowledge and Data Engineering.

[2]  Héctor Pomares,et al.  mHealthDroid: A Novel Framework for Agile Development of Mobile Health Applications , 2014, IWAAL.

[3]  Paul M. Thompson,et al.  Analysis of sampling techniques for imbalanced data: An n=648 ADNI study , 2014, NeuroImage.

[4]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[5]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[6]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[7]  See-Kiong Ng,et al.  SPO: Structure Preserving Oversampling for Imbalanced Time Series Classification , 2011, 2011 IEEE 11th International Conference on Data Mining.

[8]  R. J. Alcock,et al.  Time-Series Similarity Queries Employing a Feature-Based Approach , 1999 .

[9]  Ajinkya More,et al.  Survey of resampling techniques for improving classification performance in unbalanced datasets , 2016, ArXiv.

[10]  Jason Lines,et al.  Classification of Household Devices by Electricity Usage Profiles , 2011, IDEAL.

[11]  Jeffrey M. Hausdorff,et al.  Physionet: Components of a New Research Resource for Complex Physiologic Signals". Circu-lation Vol , 2000 .

[12]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[13]  Eamonn J. Keogh,et al.  A general framework for never-ending learning from time series streams , 2015, Data Mining and Knowledge Discovery.

[14]  Steve Weston,et al.  Foreach Parallel Adaptor for the 'parallel' Package , 2015 .

[15]  Vincent Y. F. Tan,et al.  A Parsimonious Mixture of Gaussian Trees Model for Oversampling in Imbalanced and Multimodal Time-Series Classification , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[16]  Steve Weston,et al.  Provides Foreach Looping Construct for R , 2015 .

[17]  Duc Truong Pham,et al.  Control chart pattern recognition using a new type of self-organizing neural network , 1998 .

[18]  Jos B T M Roerdink,et al.  Automatic segmentation of diatom images for classification , 2004, Microscopy research and technique.

[19]  Xavier Robin,et al.  pROC: an open-source package for R and S+ to analyze and compare ROC curves , 2011, BMC Bioinformatics.

[20]  Chengqi Zhang,et al.  A Comparative Study of Sampling Methods and Algorithms for Imbalanced Time Series Classification , 2012, Australasian Conference on Artificial Intelligence.

[21]  Matthew Dixon,et al.  Sequence Classification of the Limit Order Book Using Recurrent Neural Networks , 2017, J. Comput. Sci..

[22]  Stephen Weston,et al.  Foreach Parallel Adaptor for the 'snow' Package , 2015 .