Data reduction makes datasets smaller but preserves classification structures of interest. In this paper we present a novel approach to data reduction based on lattice and hyper relations. Hyper relations are a generalization of conventional database relations in the sense that we allow sets of values as tuple entries. The advantage of this is that raw data and reduced data can both be represented by hyper relations. The collection of hyper relations can be naturally made into a complete Boolean algebra, and so for any collection of hyper tuples we can find its unique least upper bound (lub) as a reduction of it. We show that the lub may not qualify as a reduced version of the given set of tuples, but the interior cover - the subset of internal elements covered by the lub- does qualify. We establish the theoretical result that such an interior cover exists, and find a way to find it. The proposed method was evaluated using 7 real world datasets. The results were quite remarkable compared with those obtained by C4.5, and the datasets were reduced with reduction ratios up to 99%.
[1]
Sholom M. Weiss,et al.
Predictive data mining - a practical guide
,
1997
.
[2]
David H. Wolpert,et al.
The Relationship Between Occam's Razor and Convergent Guessing
,
1990,
Complex Syst..
[3]
Catherine Blake,et al.
UCI Repository of machine learning databases
,
1998
.
[4]
G. Grätzer.
General Lattice Theory
,
1978
.
[5]
Ivo Düntsch,et al.
Simple data filtering in rough set systems
,
1998,
Int. J. Approx. Reason..
[6]
Ivo Düntsch,et al.
Algebraic Aspects of Attribute Dependencies in Information Systems
,
1997,
Fundam. Informaticae.
[7]
E. F. Codd,et al.
A relational model of data for large shared data banks
,
1970,
CACM.
[8]
J. D. Uiiman,et al.
Principles of Database Systems
,
2004,
PODS 2004.
[9]
David J. Spiegelhalter,et al.
Machine Learning, Neural and Statistical Classification
,
2009
.