Efficient Algorithms for Similarity and Skyline Summary on Multidimensional Datasets

Efficient management of large multidimensional datasets has attracted much attention in the database research community. Such large multidimensional datasets are common and efficient algorithms are needed for analyzing these data sets for a variety of applications. In this thesis, we focus our study on two very common classes of analysis: similarity and skyline summarization. We first focus on similarity when one of the dimensions in the multidimensional dataset is temporal. We then develop algorithms for evaluating skyline summaries effectively for both temporal and low-cardinality attribute domain datasets and propose different methods for improving the effectiveness of the skyline summary operation. This thesis begins by studying similarity measures for time-series datasets and efficient algorithms for time-series similarity evaluation. The first contribution of this thesis is a new algorithm, called the Fast Time Series Evaluation (FTSE) method, which can be used to evaluate similarity methods whose matching criteria is bounded by a specified e threshold value. We then show that FTSE can be used in a framework that can evaluate a rich range of e threshold-based scoring techniques which we call the Sequence Weighted Alignment (Swale) method. The second contribution of this thesis is the development of a new time-interval skyline operator, which continuously computes the current skyline over a data stream. We present a new algorithm called Lookout for evaluating such queries efficiently, and empirically demonstrate the scalability of this algorithm. In addition, we also examine the effect of the underlying spatial index structure when evaluating skylines. Whereas previous work on skyline computations have only considered using the R*-tree index structure, we show that for skyline computations using an underlying quadtree has significant performance benefits over an R*-tree index. Current skyline evaluation techniques follow a common paradigm that eliminates data elements from skyline consideration by finding other elements in the dataset that dominate them. The performance of such techniques is heavily influenced by the underlying data distribution. The third contribution of this thesis is a novel technique called the Lattice Skyline Algorithm (LS) that is built around a new paradigm for skyline evaluation on datasets with attributes that are drawn from low-cardinality domains. LS continues to apply even if one attribute has high cardinality. The utility of the skyline as a data summarization technique is often diminished by the shear volume of points in the skyline The final contribution of this thesis is a novel scheme called the Skyline Point Ordering (SPO) which remedies the skyline volume problem by ranking the elements of the skyline based on their importance to the skyline summary, allowing for the most important skyline points to appear first in the skyline result set and providing monotonic top-k skyline queries that simplify the skyline results. We describe two new algorithms, the Skyline First (SF) and the Coverage First (CF), for ranking the skyline points in a dataset on their summarization importance. Collectively, the techniques described in this thesis present efficient methods for two common and computationally intensive analysis operations on large multidimensional datasets.

[1]  Margrit Betke,et al.  THE CAMERA MOUSE: PRELIMINARY INVESTIGATION OF AUTOMATED VISUAL TRACKING FOR COMPUTER ACCESS , 2000 .

[2]  D. H. McLain,et al.  Drawing Contours from Arbitrary Data Points , 1974, Comput. J..

[3]  Te-Feng Su,et al.  Thinning algorithms based on quadtree and octree representations , 2006, Inf. Sci..

[4]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[5]  Dimitrios Gunopulos,et al.  Indexing multi-dimensional time-series with support for multiple distance measures , 2003, KDD '03.

[6]  Hanan Samet,et al.  The Quadtree and Related Hierarchical Data Structures , 1984, CSUR.

[7]  Wesley W. Chu,et al.  An index-based approach for similarity search supporting time warping in large sequence databases , 2001, Proceedings 17th International Conference on Data Engineering.

[8]  Christos Faloutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[9]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[10]  Christos Faloutsos,et al.  FTW: fast similarity search under the time warping distance , 2005, PODS.

[11]  Eamonn J. Keogh,et al.  Locally adaptive dimensionality reduction for indexing large time series databases , 2001, SIGMOD '01.

[12]  Donald Kossmann,et al.  The Skyline operator , 2001, Proceedings 17th International Conference on Data Engineering.

[13]  Jian Pei,et al.  Catching the Best Views of Skyline: A Semantic Approach Based on Decisive Subspaces , 2005, VLDB.

[14]  Yufei Tao,et al.  Maintaining sliding window skylines on data streams , 2006, IEEE Transactions on Knowledge and Data Engineering.

[15]  Hanan Samet,et al.  Foundations of multidimensional and metric data structures , 2006, Morgan Kaufmann series in data management systems.

[16]  Bernhard Seeger,et al.  An optimal and progressive algorithm for skyline queries , 2003, SIGMOD '03.

[17]  Michael J. Carey,et al.  On saying “Enough already!” in SQL , 1997, SIGMOD '97.

[18]  Anthony K. H. Tung,et al.  Discovering strong skyline points in high dimensional spaces , 2005, CIKM '05.

[19]  Bernhard Seeger,et al.  Progressive skyline computation in database systems , 2005, TODS.

[20]  Michael Ian Shamos,et al.  Computational geometry: an introduction , 1985 .

[21]  Christos Faloutsos,et al.  Efficiently supporting ad hoc queries in large datasets of time sequences , 1997, SIGMOD '97.

[22]  Anthony K. H. Tung,et al.  DADA: a data cube for dominant relationship analysis , 2006, SIGMOD Conference.

[23]  Anthony K. H. Tung,et al.  On High Dimensional Skylines , 2006, EDBT.

[24]  Kothuri Venkata Ravi Kanth,et al.  Quadtree and R-tree indexes in oracle spatial: a comparison using GIS data , 2002, SIGMOD '02.

[25]  A. Guttman,et al.  A Dynamic Index Structure for Spatial Searching , 1984, SIGMOD 1984.

[26]  Christos Faloutsos,et al.  Fast Time Sequence Indexing for Arbitrary Lp Norms , 2000, VLDB.

[27]  Surajit Chaudhuri,et al.  Estimating Progress of Long Running SQL Queries , 2004, SIGMOD Conference.

[28]  Mario A. López,et al.  STR: a simple and efficient algorithm for R-tree packing , 1997, Proceedings 13th International Conference on Data Engineering.

[29]  Hongjun Lu,et al.  Stabbing the sky: efficient skyline computation over sliding windows , 2005, 21st International Conference on Data Engineering (ICDE'05).

[30]  Eamonn J. Keogh,et al.  Scaling up Dynamic Time Warping to Massive Dataset , 1999, PKDD.

[31]  Thomas G. Szymanski,et al.  A fast algorithm for computing longest common subsequences , 1977, CACM.

[32]  Nick Roussopoulos,et al.  Nearest neighbor queries , 1995, SIGMOD '95.

[33]  Jeffrey F. Naughton,et al.  Toward a progress indicator for database queries , 2004, SIGMOD '04.

[34]  Hanan Samet,et al.  Speeding up construction of PMR quadtree-based spatial indexes , 2002, The VLDB Journal.

[35]  Hanan Samet,et al.  Distance browsing in spatial databases , 1999, TODS.

[36]  Beng Chin Ooi,et al.  Efficient Progressive Skyline Computation , 2001, VLDB.

[37]  Mihalis Yannakakis,et al.  Multiobjective query optimization , 2001, PODS '01.

[38]  Jirí Matousek,et al.  Computing Dominances in E^n , 1991, Inf. Process. Lett..

[39]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[40]  Ivan Stojmenovic,et al.  An optimal parallel algorithm for solving the maximal elements problem in the plane , 1988, Parallel Comput..

[41]  Shapour Azarm,et al.  Metrics for Quality Assessment of a Multiobjective Design Optimization Solution Set , 2001 .

[42]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.

[43]  Xuemin Lin,et al.  Selecting Stars: The k Most Representative Skyline Operator , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[44]  Dennis Shasha,et al.  Warping indexes with envelope transforms for query by humming , 2003, SIGMOD '03.

[45]  Dimitrios Gunopulos,et al.  Discovering similar multidimensional trajectories , 2002, Proceedings 18th International Conference on Data Engineering.

[46]  Eamonn J. Keogh,et al.  Exact indexing of dynamic time warping , 2002, Knowledge and Information Systems.

[47]  F. Itakura,et al.  Minimum prediction residual principle applied to speech recognition , 1975 .

[48]  Renée J. Miller,et al.  Similarity search over time-series data using wavelets , 2002, Proceedings 18th International Conference on Data Engineering.

[49]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[50]  Eamonn J. Keogh,et al.  Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases , 2001, Knowledge and Information Systems.

[51]  Anthony K. H. Tung,et al.  Continuous Skyline Queries for Moving Objects , 2006, IEEE Transactions on Knowledge and Data Engineering.

[52]  Eamonn J. Keogh,et al.  Making Time-Series Classification More Accurate Using Learned Constraints , 2004, SDM.

[53]  Yannis Manolopoulos,et al.  R-Trees: Theory and Applications , 2005, Advanced Information and Knowledge Processing.

[54]  Jignesh M. Patel,et al.  Rethinking Choices for Multi-dimensional Point Indexing: Making the Case for the Often Ignored Quadtree , 2007, CIDR.

[55]  H. T. Kung,et al.  On the Average Number of Maxima in a Set of Vectors and Applications , 1978, JACM.

[56]  Hanan Samet,et al.  Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling) , 2005 .

[57]  Dina Q. Goldin,et al.  On Similarity Queries for Time-Series Data: Constraint Specification and Implementation , 1995, CP.

[58]  Ralph E. Steuer Multiple criteria optimization , 1986 .

[59]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[60]  Jignesh M. Patel,et al.  Efficient Continuous Skyline Computation , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[61]  Yannis Manolopoulos,et al.  Advanced Database Indexing , 1999, Advances in Database Systems.

[62]  Daniel Boley,et al.  Streaming data reduction using low-memory factored representations , 2006, Inf. Sci..

[63]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[64]  Christos Faloutsos,et al.  Efficient retrieval of similar time sequences under time warping , 1998, Proceedings 14th International Conference on Data Engineering.

[65]  Jiong Yang,et al.  CLUSEQ: efficient and effective sequence clustering , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[66]  Anthony K. H. Tung,et al.  Finding k-dominant skylines in high dimensional space , 2006, SIGMOD Conference.

[67]  G. R. Cross,et al.  An improved algorithm to find the length of the longest common subsequence of two strings , 1989, SIGF.

[68]  Christian Böhm,et al.  Determining the Convex Hull in Large Multidimensional Databases , 2001, DaWaK.

[69]  Qing Liu,et al.  Efficient Computation of the Skyline Cube , 2005, VLDB.

[70]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[71]  Jan Chomicki,et al.  Skyline with presorting , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[72]  Lei Chen,et al.  Robust and fast similarity search for moving object trajectories , 2005, SIGMOD '05.

[73]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[74]  Ada Wai-Chee Fu,et al.  Efficient time series matching by wavelets , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[75]  Jarek Gryz,et al.  Maximal Vector Computation in Large Data Sets , 2005, VLDB.

[76]  Irene Gargantini,et al.  An effective way to represent quadtrees , 1982, CACM.

[77]  David J. DeWitt,et al.  Shoring up persistent applications , 1994, SIGMOD '94.

[78]  Lei Chen,et al.  On the Marriage of Edit Distance and Lp Norms , 2004, VLDB 2004.

[79]  Eamonn J. Keogh,et al.  On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration , 2002, Data Mining and Knowledge Discovery.

[80]  Donald Kossmann,et al.  Shooting Stars in the Sky: An Online Algorithm for Skyline Queries , 2002, VLDB.