Categorical Data Skyline Using Classification Tree

Skyline query is an effective method to process large-sized multidimensional data sets as it can pinpoint the target data so that dominated data (say, 95% of data) can be efficiently excluded as unnecessary data objects. However, most of the conventional skyline algorithms were developed to handle numerical data. Thus, most of the text data were excluded from being processed by the algorithms. In this paper, we pioneer an entirely new domain for skyline query--namely, the categorical data--with which the corresponding ranking measures for the skyline queries are developed. We tested our proposed algorithm using the ACM Computing Classification System.

[1]  Soon-Young Huh,et al.  Relaxing Queries with Hierarchical Quantified Data Abstraction , 2008, J. Database Manag..

[2]  Salvatore J. Stolfo,et al.  A Geometric Framework for Unsupervised Anomaly Detection , 2002, Applications of Data Mining in Computer Security.

[3]  Donald Kossmann,et al.  The Skyline operator , 2001, Proceedings 17th International Conference on Data Engineering.

[4]  Jian Pei,et al.  Distance-Based Representative Skyline , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[5]  Sara Cohen,et al.  Flexible XML Querying Using Skyline Semantics , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[6]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[7]  Bernhard Seeger,et al.  An optimal and progressive algorithm for skyline queries , 2003, SIGMOD '03.

[8]  Sushil Jajodia,et al.  Applications of Data Mining in Computer Security , 2002, Advances in Information Security.

[9]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[10]  Donald Kossmann,et al.  Shooting Stars in the Sky: An Online Algorithm for Skyline Queries , 2002, VLDB.

[11]  Beng Chin Ooi,et al.  Efficient Progressive Skyline Computation , 2001, VLDB.

[12]  Mikhail J. Atallah,et al.  Computing all skyline probabilities for uncertain data , 2009, PODS.

[13]  Vipin Kumar,et al.  Similarity Measures for Categorical Data: A Comparative Evaluation , 2008, SDM.

[14]  Kian-Lee Tan,et al.  Stratified computation of skylines with partially-ordered domains , 2005, SIGMOD '05.

[15]  Stavros Papadopoulos,et al.  Topologically Sorted Skylines for Partially Ordered Domains , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[16]  Seung-won Hwang,et al.  Automatic categorization of query results , 2004, SIGMOD '04.

[17]  T. P. Burnaby On a method for character weighting a similarity coefficient, employing the concept of information , 1970 .

[18]  Seung-won Hwang,et al.  Mining and processing category ranking , 2007, SAC '07.

[19]  Nikos Mamoulis,et al.  Scalable skyline computation using object-based space partitioning , 2009, SIGMOD Conference.

[20]  Anthony K. H. Tung,et al.  Categorical skylines for streaming data , 2008, SIGMOD Conference.

[21]  D. W. Goodall A New Similarity Index Based on Probability , 1966 .