Mining lake time series using symbolic representation

Sensor networks deployed in lakes and reservoirs, when combined with simulation models and expert knowledge from the global community, are creating deeper understanding of the ecological dynamics of lakes. However, the amount of data and the complex patterns in the data demand substantial compute resources and efficient data mining algorithms, both of which are beyond the realm of traditional limnological research. This paper uniquely adapts methods from computer science for application to data intensive ecological questions, in order to provide ecologists with approachable methodology to facilitate knowledge discovery in lake ecology. We apply a state-of-the-art time series mining technique based on symbolic representation (SAX) to high-frequency time series of phycocyanin (PHYCO) and chlorophyll (CHLORO) fluorescence, both of which are indicators of algal biomass in lakes, as well as model predictions of algal biomass (MODEL). We use data mining techniques to demonstrate that MODEL predicts PHYCO better than it predicts CHLORO. All time series have high redundancy, resulting in a relatively small subset of unique patterns. However, MODEL is much less complex than either PHYCO or CHLORO and fails to reproduce high biomass periods indicative of algal blooms. We develop a set of tools in R to enable motif discovery and anomaly detection within a single lake time series, and relationship study among multiple lake time series through distance metrics, clustering and classification. Furthermore, to improve computation times, we provision web services to launch R tools remotely on high performance computing (HPC) resources. Comprehensive experimental results on observational and simulated lake data demonstrate the effectiveness of our approach.

[1]  Jessica Lin,et al.  Finding Motifs in Time Series , 2002, KDD 2002.

[2]  Pierre Geurts,et al.  Pattern Extraction for Time Series Classification , 2001, PKDD.

[3]  Paul C Hanson,et al.  Staying afloat in the sensor data deluge. , 2012, Trends in ecology & evolution.

[4]  Christos Faloutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[5]  David P. Hamilton,et al.  Predicting the resilience and recovery of aquatic systems: A framework for model evolution within environmental observatories , 2015 .

[6]  Eamonn J. Keogh,et al.  iSAX: indexing and mining terabyte sized time series , 2008, KDD.

[7]  Ada Wai-Chee Fu,et al.  Efficient time series matching by wavelets , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[8]  Henrik André-Jönsson,et al.  Using Signature Files for Querying Time-Series Data , 1997, PKDD.

[9]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[10]  Eamonn J. Keogh,et al.  Locally adaptive dimensionality reduction for indexing large time series databases , 2001, SIGMOD '01.

[11]  Eamonn J. Keogh,et al.  A symbolic representation of time series, with implications for streaming algorithms , 2003, DMKD '03.

[12]  Corinna Gries,et al.  Information management at the North Temperate Lakes Long-term Ecological Research site - Successful support of research in a large, diverse, and long running project , 2016, Ecol. Informatics.

[13]  B. Sahakian,et al.  Spline Functions and Multivariate Interpolations , 1993 .

[14]  Philip K. McKinley,et al.  Automated Ensemble Extraction and Analysis of Acoustic Data Streams , 2007, 27th International Conference on Distributed Computing Systems Workshops (ICDCSW'07).

[15]  Ulrich Sommer,et al.  The PEG-model of seasonal succession of planktonic events in fresh waters , 1986, Archiv für Hydrobiologie.

[16]  Eamonn J. Keogh,et al.  An Enhanced Representation of Time Series Which Allows Fast and Accurate Classification, Clustering and Relevance Feedback , 1998, KDD.

[17]  Beth Stauffer,et al.  Emerging Tools for Continuous Nutrient Monitoring Networks: Sensors Advancing Science and Water Resources Protection , 2016 .

[18]  David P. Hamilton,et al.  Time-scale dependence in numerical simulations: Assessment of physical, chemical, and biological predictions in a stratified lake at temporal scales of hours to months , 2012, Environ. Model. Softw..

[19]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[20]  Nitin Kumar,et al.  Time-series Bitmaps: a Practical Visualization Tool for Working with Large Time Series Databases , 2005, SDM.

[21]  C. Finney,et al.  A review of symbolic analysis of experimental data , 2003 .

[22]  Li Wei,et al.  Experiencing SAX: a novel symbolic representation of time series , 2007, Data Mining and Knowledge Discovery.

[23]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.

[24]  R. Larsen,et al.  An introduction to mathematical statistics and its applications (2nd edition) , by R. J. Larsen and M. L. Marx. Pp 630. £17·95. 1987. ISBN 13-487166-9 (Prentice-Hall) , 1987, The Mathematical Gazette.

[25]  Christos Faloutsos,et al.  Fast Time Sequence Indexing for Arbitrary Lp Norms , 2000, VLDB.

[26]  Liu Yi,et al.  Application Research of a New Symbolic Approximation Method-SAX in Time Series Mining , 2006 .

[27]  D. Schindler Evolution of phosphorus limitation in lakes. , 1977, Science.