An Interval Classifier for Database Mining Applications

We are given a large population database that contains information about population instances. The population is known to comprise of m groups, but the population instances are not labeled with the group identi cation. Also given is a population sample (much smaller than the population but representative of it) in which the group labels of the instances are known. We present an interval classi er (IC) which generates a classi cation function for each group that can be used to e ciently retrieve all instances of the specied group from the population database. To allow IC to be embedded in interactive loops to answer adhoc queries about attributes with missing values, IC has been designed to be e cient in the generation of classi cation functions. Preliminary experimental results indicate that IC not only has retrieval and classi er generation e ciency advantages, but also compares favorably in the classi cation accuracy with current tree classi ers, such as ID3, which were primarily designed for minimizing classi cation errors. We also describe some new applications that arise from encapsulating the classi cation capability in database systems and discuss extensions to IC for it to be used in these new application domains. Current address: Computer Science Department, Rutgers University, NJ 08903 Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment. Proceedings of the 18th VLDB Conference Vancouver, British Columbia, Canada 1992

[1]  Wray Buntine,et al.  Collected Notes on the Workshop for Pattern Discovery in Large Databases , 1991 .

[2]  William Frawley,et al.  Knowledge Discovery in Databases , 1991 .

[3]  Ronald L. Rivest,et al.  Constructing Optimal Binary Decision Trees is NP-Complete , 1976, Inf. Process. Lett..

[4]  Tomasz Imielinski,et al.  Research Directions in Knowledge Discovery , 1991, SIGMOD Rec..

[5]  李幼升,et al.  Ph , 1989 .

[6]  Shalom Tsur,et al.  Data Dredging , 1990, IEEE Data Eng. Bull..

[7]  Richard P. Lippmann,et al.  An introduction to computing with neural nets , 1987 .

[8]  Gomer Thomas,et al.  Practitioner problems in need of database research , 1991, SGMD.

[9]  J. Ross Quinlan,et al.  Simplifying Decision Trees , 1987, Int. J. Man Mach. Stud..

[10]  Jerome H. Friedman,et al.  Graphical Methods of Exploratory Data Analysis , 1985 .

[11]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[12]  W. J. Langford Statistical Methods , 1959, Nature.

[13]  R. Gray,et al.  Applications of information theory to pattern recognition and the design of decision trees and trellises , 1988 .

[14]  Ronald L. Rivest,et al.  Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[15]  R. Lippmann,et al.  An introduction to computing with neural nets , 1987, IEEE ASSP Magazine.

[16]  David J. DeWitt,et al.  Benchmarking Database Systems A Systematic Approach , 1983, VLDB.

[17]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .