On Efficient Handling of Continuous Attributes in Large Data Bases

Some data mining techniques, like discretization of continuous attributes or decision tree induction, are based on searching for an optimal partition of data with respect to some optimization criteria. We investigate the problem of searching for optimal binary partition of continuous attribute domain in case of large data sets stored in relational data bases (RDB). The critical for time complexity of algorithms solving this problem is the number of I/O database operations necessary to construct such partitions. In our approach the basic operators are defined by queries on the number of objects characterized by means of real value intervals of continuous attributes. We assume the answer time for such queries does not depend on the interval length. The straightforward approach to the optimal partition selection (with respect to a given measure) requires O(N) basic queries, where N is the number of preassumed partition parts in the searching space. We show properties of the basic optimization measures making possible to reduce the size of searching space. Moreover, we prove that using only O(\log N) simple queries, one can construct a partition very close to optimal.

[1]  Huan Liu,et al.  Chi2: feature selection and discretization of numeric attributes , 1995, Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence.

[2]  Janusz Zalewski,et al.  Rough sets: Theoretical aspects of reasoning about data , 1996 .

[3]  Randy Kerber,et al.  ChiMerge: Discretization of Numeric Attributes , 1992, AAAI.

[4]  Hung Son Nguyen,et al.  From Optimal Hyperplanes to Optimal Decision Trees , 1998, Fundam. Informaticae.

[5]  Usama M. Fayyad,et al.  The Attribute Selection Problem in Decision Tree Generation , 1992, AAAI.

[6]  Hung Son Nguyen,et al.  On Efficient Construction of Decision Trees from Large Databases , 2000, Rough Sets and Current Trends in Computing.

[7]  Andrzej Skowron,et al.  Boolean Reasoning for Feature Extraction Problems , 1997, ISMIS.

[8]  Jason Catlett,et al.  On Changing Continuous Attributes into Ordered Discrete Attributes , 1991, EWSL.

[9]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[10]  Lech Polkowski,et al.  Rough Sets in Knowledge Discovery 2 , 1998 .

[11]  Andrzej Skowron,et al.  Rough Sets: A Tutorial , 1998 .

[12]  Andrzej SkowronInstitute,et al.  QUANTIZATION OF REAL VALUE ATTRIBUTESRough Set and Boolean Reasoning , 1995 .

[13]  Hung Son Nguyen,et al.  Efficient SQL-Querying Method for Data Mining in Large Data Bases , 1999, IJCAI.

[14]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[15]  Andrzej Skowron,et al.  Rough-Fuzzy Hybridization: A New Trend in Decision Making , 1999 .

[16]  Jerzy W. Grzymala-Busse,et al.  Global discretization of continuous attributes as preprocessing for machine learning , 1996, Int. J. Approx. Reason..

[17]  Pat Langley,et al.  Static Versus Dynamic Sampling for Data Mining , 1996, KDD.

[18]  Andrzej Skowron,et al.  The Discernibility Matrices and Functions in Information Systems , 1992, Intelligent Decision Support.

[19]  Se June Hong,et al.  Use of Contextaul Information for Feature Ranking and Discretization , 1997, IEEE Trans. Knowl. Data Eng..