A Probabilistic Analysis of Trie-Based Sorting of Large Collections of Line Segments in Spatial Databases

The size of five trie-based methods of sorting large collections of line segments in a spatial database is investigated analytically using a random lines image model and geometric probability techniques. The methods are based on sorting the line segments with respect to the space that they occupy. Since the space is two-dimensional, the trie is formed by interleaving the bits corresponding to the binary representation of the x and y coordinates of the underlying space and then testing two bits at each iteration. The result of this formulation yields a class of representations that are referred to as quadtrie variants, although they have been traditionally referred to as quadtree variants. The analysis differs from prior work in that it uses a detailed explicit model of the image instead of relying on modeling the branching process represented by the tree and leaving the underlying image unspecified. The analysis provides analytic expressions and bounds on the expected size of these quadtree variants. This enables the prediction of storage required by the representations and of the associated performance of algorithms that rely on them. The results are useful in the following two ways: They reveal the properties of the various representations and permit their comparison using analytic, nonexperimental criteria. Some of the results confirm previous analyses (e.g., that the storage requirement of the MX quadtree is proportional to the total lengths of the line segments). An important new result is that for a PMR and Bucket PMR quadtree with sufficiently high values of the splitting threshold (i.e., $\geq 4$) the number of nodes is proportional to the number of line segments and is independent of the maximum depth of the tree. This provides a theoretical justification for the good behavior and use of the PMR quadtree, which so far has been only of an empirical nature. The random lines model was found to be general enough to approximate real data in the sense that the properties of the trie representations, when used to store real data (e.g., maps), are similar to their properties when storing random lines data. Therefore, by specifying an equivalent random lines model for a real map, the proposed analytical expressions can be applied to predict the storage required for real data. Specifying the equivalent random lines model requires only an estimate of the effective number of random lines in it. Several such estimates are derived for real images, and the accuracy of the implied predictions is demonstrated on a real collection of maps. The agreement between the predictions and real data suggests that they could serve as the basis of a cost model that can be used by a query optimizer to generate an appropriate query evaluation plan.

[1]  Donald Meagher,et al.  Geometric modeling using octree encoding , 1982, Comput. Graph. Image Process..

[2]  Hanan Samet,et al.  Data-Parallel Spatial Join Algorithms , 1994, 1994 International Conference on Parallel Processing Vol. 3.

[3]  Claude Puech,et al.  Quadtrees, octrees, hyperoctrees: a unified analytical approach to tree data structures used in graphics, geometric modeling and image processing , 1985, SCG '85.

[4]  Bruno O. Shubert,et al.  Random variables and stochastic processes , 1979 .

[5]  Hanan Samet,et al.  Foundations of multidimensional and metric data structures , 2006, Morgan Kaufmann series in data management systems.

[6]  Clifford A. Shaffer,et al.  Digitizing the Plane with Cells of Nonuniform Size , 1987, Inf. Process. Lett..

[7]  Hanan Samet,et al.  A population analysis for hierarchical data structures , 1987, SIGMOD '87.

[8]  Jeffrey Scott Vitter,et al.  Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[9]  David J. DeWitt,et al.  Equi-depth multidimensional histograms , 1988, SIGMOD '88.

[10]  Martin J. Dürst,et al.  The design and analysis of spatial data structures. Applications of spatial data structures: computer graphics, image processing, and GIS , 1991 .

[11]  Chris L. Jackins,et al.  Oct-trees and their use in representing three-dimensional objects , 1980 .

[12]  Christos Faloutsos,et al.  Analysis of the n-Dimensional Quadtree Decomposition for Arbitrary Hyperectangles , 1997, IEEE Trans. Knowl. Data Eng..

[13]  Philippe Flajolet,et al.  Search costs in quadtrees and singularity perturbation asymptotics , 1994, Discret. Comput. Geom..

[14]  Philippe Flajolet,et al.  Page usage in a quadtree index , 1992, BIT Comput. Sci. Sect..

[15]  Edward Fredkin,et al.  Trie memory , 1960, Commun. ACM.

[16]  Allen Klinger,et al.  PATTERNS AND SEARCH STATISTICS , 1971 .

[17]  Jon Louis Bentley,et al.  Quad trees a data structure for retrieval on composite keys , 1974, Acta Informatica.

[18]  Hanan Samet,et al.  Performance of Data-Parallel Spatial Operations , 1994, VLDB.

[19]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[20]  Clifford A. Shaffer,et al.  QUILT: a geographic information system based on quadtrees , 1990, Int. J. Geogr. Inf. Sci..

[21]  Philippe Flajolet,et al.  Partial match retrieval of multidimensional data , 1986, JACM.

[22]  Claude Puech,et al.  Average Efficiency of Data Structures for Binary Image Processing , 1987, Inf. Process. Lett..

[23]  Kenneth Steiglitz,et al.  Operations on Images Using Quad Trees , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Aris M. Ouksel,et al.  Storage mappings for multidimensional linear dynamic hashing , 1983, PODS.

[25]  Gaston H. Gonnet,et al.  Analytic variations on quadtrees , 2005, Algorithmica.

[26]  Goetz Graefe,et al.  Query evaluation techniques for large databases , 1993, CSUR.

[27]  Jack A. Orenstein Spatial query processing in an object-oriented database system , 1986, SIGMOD '86.

[28]  Hanan Samet,et al.  Storing a collection of polygons using quadtrees , 1985, TOGS.

[29]  Markku Tamminen,et al.  Encoding pixel trees , 1984, Comput. Vis. Graph. Image Process..

[30]  Hanan Samet,et al.  The Design and Analysis of Spatial Data Structures , 1989 .

[31]  Rene De La Briandais File searching using variable length keys , 1959, IRE-AIEE-ACM Computer Conference.

[32]  Hanan Samet,et al.  Efficient Component Labeling of Images of Arbitrary Dimension Represented by Linear Bintrees , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[33]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[34]  Hanan Samet,et al.  A consistent hierarchical representation for vector data , 1986, SIGGRAPH.

[35]  Azriel Rosenfeld,et al.  Application of Hierarchical Data Structures to Geographical Information Systems. , 1983 .

[36]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[37]  Joel H. Saltz,et al.  Scalability Analysis of Declustering Methods for Multidimensional Range Queries , 1998, IEEE Trans. Knowl. Data Eng..

[38]  Yannis Manolopoulos,et al.  Analytical Results on the Quadtree Storage-Requirements , 1993, CAIP.

[39]  Charles R. Dyer,et al.  The space efficiency of quadtrees , 1982, Comput. Graph. Image Process..

[40]  John G. Proakis,et al.  Probability, random variables and stochastic processes , 1985, IEEE Trans. Acoust. Speech Signal Process..

[41]  Gershon Kedem The Quad-CIF Tree: A Data Structure for Hierarchical On-Line Algorithms , 1982, 19th Design Automation Conference.

[42]  Christos Faloutsos,et al.  Beyond uniformity and independence: analysis of R-trees using the concept of fractal dimension , 1994, PODS.

[43]  Clifford A. Shaffer,et al.  A formula for computing the number of quadtree node fragments created by a shift , 1988, Pattern Recognit. Lett..

[44]  Gregory Michael Hunter,et al.  Efficient computation and data structures for graphics. , 1978 .

[45]  Christos Faloutsos,et al.  Analysis of object oriented spatial access methods , 1987, SIGMOD '87.

[46]  L. Santaló Integral geometry and geometric probability , 1976 .

[47]  Hanan Samet,et al.  Applications of spatial data structures , 1989 .

[48]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[49]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[50]  Markku Tamminen Comment on Quad- and Octtrees , 1984, CACM.

[51]  K. Knowlton,et al.  Progressive transmission of grey-scale and binary pictures by simple, efficient, and lossless encoding schemes , 1980, Proceedings of the IEEE.

[52]  Bernd-Uwe Pagel,et al.  Towards an analysis of range query performance in spatial data structures , 1993, PODS '93.

[53]  Deok-Hwan Kim,et al.  Multi-dimensional selectivity estimation using compressed histogram information , 1999, SIGMOD '99.

[54]  Clifford A. Shaffer,et al.  Generalized comparison of quadtree and bintree storage requirements , 1993, Image Vis. Comput..