SciQL: bridging the gap between science and relational DBMS

Scientific discoveries increasingly rely on the ability to efficiently grind massive amounts of experimental data using database technologies. To bridge the gap between the needs of the Data-Intensive Research fields and the current DBMS technologies, we propose SciQL (pronounced as 'cycle'), the first SQL-based query language for scientific applications with both tables and arrays as first class citizens. It provides a seamless symbiosis of array-, set- and sequence-interpretations. A key innovation is the extension of value-based grouping of SQL:2003 with structural grouping, i.e., fixed-sized and unbounded groups based on explicit relationships between elements positions. This leads to a generalisation of window-based query processing with wide applicability in science domains. This paper describes the main language features of SciQL and illustrates it using time-series concepts.

[1]  Eamonn J. Keogh,et al.  On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration , 2002, Data Mining and Knowledge Discovery.

[2]  Limsoon Wong,et al.  A query language for multidimensional arrays: design, implementation, and optimization techniques , 1996, SIGMOD '96.

[3]  Xiaoming Jin,et al.  Similarity measure based on partial information of time series , 2002, KDD.

[4]  Peter Baumann,et al.  A Database Array Algebra for Spatio-Temporal Data and Beyond , 1999, NGITS.

[5]  Martin L. Kersten,et al.  Distribution Rules for Array Database Queries , 2005, DEXA.

[6]  Peter Boncz,et al.  UvA-DARE ( Digital Academic Repository ) Monet ; a next-Generation DBMS Kernel For Query-Intensive Applications , 2007 .

[7]  Lei Chen,et al.  Robust and fast similarity search for moving object trajectories , 2005, SIGMOD '05.

[8]  Roberto Cornacchia,et al.  Flexible and efficient IR using array databases , 2007, The VLDB Journal.

[9]  Juan Pedro Caraça-Valente,et al.  Discovering similar patterns in time series , 2000, KDD '00.

[10]  Kenneth Salem,et al.  Query processing techniques for arrays , 1999, SIGMOD '99.

[11]  Eamonn J. Keogh,et al.  Online discovery and maintenance of time series motifs , 2010, KDD.

[12]  Michael Stonebraker,et al.  A Demonstration of SciDB: A Science-Oriented DBMS , 2009, Proc. VLDB Endow..

[13]  Laks V. S. Lakshmanan,et al.  A Foundation for Multi-dimensional Databases , 1997, VLDB.

[14]  David Maier,et al.  Algebraic manipulation of scientific datasets , 2004, The VLDB Journal.

[15]  Eamonn J. Keogh,et al.  iSAX: indexing and mining terabyte sized time series , 2008, KDD.

[16]  T. Warren Liao,et al.  Clustering of time series data - a survey , 2005, Pattern Recognit..

[17]  Eamonn J. Keogh,et al.  Finding surprising patterns in a time series database in linear time and space , 2002, KDD.

[18]  Hui Ding,et al.  Querying and mining of time series data: experimental comparison of representations and distance measures , 2008, Proc. VLDB Endow..

[19]  Peter Baumann,et al.  The multidimensional database system RasDaMan , 1998, SIGMOD '98.

[20]  Gareth J. Janacek,et al.  Clustering time series from ARMA models with clipped data , 2004, KDD.

[21]  Miron Livny,et al.  The Design and Implementation of a Sequence Database System , 1996, VLDB.

[22]  Miron Livny,et al.  Sequence query processing , 1994, SIGMOD '94.

[23]  Raghu Ramakrishnan,et al.  SRQL: Sorted Relational Query Language , 1998, Proceedings. Tenth International Conference on Scientific and Statistical Database Management (Cat. No.98TB100243).

[24]  Kyuseok Shim,et al.  Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases , 1995, VLDB.

[25]  Eamonn J. Keogh,et al.  An indexing scheme for fast similarity search in large time series databases , 1999, Proceedings. Eleventh International Conference on Scientific and Statistical Database Management.

[26]  Anthony J. G. Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery [Point of View] , 2011 .

[27]  Paul G. Brown,et al.  Overview of sciDB: large scale array storage, processing and analysis , 2010, SIGMOD Conference.

[28]  François Bancilhon,et al.  Building an Object-Oriented Database System, The Story of O2 , 1992 .

[29]  Jessica Lin,et al.  Visually mining and monitoring massive time series , 2004, KDD.

[30]  Jian Pei,et al.  Interactive exploration of coherent patterns in time-series gene expression data , 2003, KDD '03.

[31]  Li Wei,et al.  Semi-supervised time series classification , 2006, KDD '06.

[32]  Philip S. Yu,et al.  Mining asynchronous periodic patterns in time series data , 2000, KDD '00.

[33]  S.B. Davidson Tale of two cultures: are there database research issues in bioinformatics? , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.

[34]  Padhraic Smyth,et al.  Deformable Markov model templates for time-series pattern matching , 2000, KDD '00.

[35]  Spyros Makridakis A Survey of Time Series , 1976 .

[36]  Vanja Josifovski,et al.  SQL/MED: a status report , 2002, SGMD.

[37]  Dennis Shasha,et al.  AQuery: Query Language for Ordered Data, Optimization Techniques, and Experiments , 2003, VLDB.

[38]  David Maier,et al.  A call to order , 1993, PODS '93.

[39]  Eamonn J. Keogh A decade of progress in indexing and mining large time series databases , 2006, VLDB.

[40]  Cláudia Antunes,et al.  Temporal Data Mining: an overview , 2001 .

[41]  Eamonn J. Keogh,et al.  Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases , 2001, Knowledge and Information Systems.

[42]  V. Kavitha,et al.  Clustering Time Series Data Stream - A Literature Survey , 2010, ArXiv.

[43]  Gavin Sherlock,et al.  The Longhorn Array Database (LAD): An Open-Source, MIAME compliant implementation of the Stanford Microarray Database (SMD) , 2003, BMC Bioinformatics.

[44]  James S. Walker Fast Fourier Transforms , 1991 .

[45]  Dimitrios Gunopulos,et al.  Indexing Multidimensional Time-Series , 2004, The VLDB Journal.

[46]  Steven W. Smith,et al.  The Scientist and Engineer's Guide to Digital Signal Processing , 1997 .

[47]  David J. DeWitt,et al.  Scientific data management in the coming decade , 2005, SGMD.

[48]  Arie Shoshani,et al.  Characteristics of Scientific Databases , 1984, VLDB.

[49]  Ying Zhang,et al.  SciQL, a query language for science applications , 2010, AD '11.

[50]  Arie Shoshani,et al.  Statistical and Scientific Database Issues , 1985, IEEE Transactions on Software Engineering.

[51]  Michael Stonebraker,et al.  Requirements for Science Data Bases and SciDB , 2009, CIDR.

[52]  Max J. Egenhofer,et al.  Why not SQL! , 1992, Int. J. Geogr. Inf. Sci..

[53]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[54]  Eamonn J. Keogh,et al.  A Simple Dimensionality Reduction Technique for Fast Similarity Search in Large Time Series Databases , 2000, PAKDD.