Decision tree classification of spatial data streams using Peano Count Trees

Many organizations have large quantities of spatial data collected in various application areas, including remote sensing, geographical information systems (GIS), astronomy, computer cartography, environmental assessment and planning, etc. These data collections are growing rapidly and can therefore be considered as spatial data streams. For data stream classification, time is a major issue. However, these spatial data sets are too large to be classified effectively in a reasonable amount of time using existing methods. In this paper, we developed a new method for decision tree classification on spatial data streams using a data structure called Peano Count Tree (P-tree). The Peano Count Tree is a spatial data organization that provides a lossless compressed representation of a spatial data set and facilitates efficient classification and other data mining techniques. Using P-tree structure, fast calculation of measurements, such as information gain, can be achieved. We compare P-tree based decision tree induction classification and a classical decision tree induction method with respect to the speed at which the classifier can be built (and rebuilt when substantial amounts of new data arrive). Experimental results show that the P-tree method is significantly faster than existing classification methods, making it the preferred method for mining on spatial data streams.

[1]  Geoff Hulten,et al.  Catching up with the Data: Research Issues in Mining Data Streams , 2001, DMKD.

[2]  Casimir A. Kulikowski,et al.  Computer Systems That Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning and Expert Systems , 1990 .

[3]  Oliver Günther,et al.  Multidimensional access methods , 1998, CSUR.

[4]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[5]  Hanan Samet,et al.  The Design and Analysis of Spatial Data Structures , 1989 .

[6]  Qiang Ding,et al.  Deriving High Confidence Rules from Spatial Data Using Peano Count Trees , 2001, WAIM.

[7]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[8]  Hanan Samet,et al.  Applications of spatial data structures , 1989 .

[9]  Ronald L. Rivest,et al.  Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[10]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[11]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[12]  Tomasz Imielinski,et al.  An Interval Classifier for Database Mining Applications , 1992, VLDB.

[13]  Qiang Ding,et al.  On Mining Satellite and other Remotely Sensed Images , 2001, DMKD.

[14]  Jon Louis Bentley,et al.  Quad trees a data structure for retrieval on composite keys , 1974, Acta Informatica.

[15]  Hanan Samet,et al.  The Quadtree and Related Hierarchical Data Structures , 1984, CSUR.

[16]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[17]  Huan Liu,et al.  Book review: Machine Learning, Neural and Statistical Classification Edited by D. Michie, D.J. Spiegelhalter and C.C. Taylor (Ellis Horwood Limited, 1994) , 1996, SGAR.