Rough set analysis is a non–numerical method of data analysis. Among other purposes, it can be used rule extraction, classification and to recognize dependencies of attributes. The paper describes a software system called GROBIAN 1 which performs rough set data analysis. In addition to the “traditional” procedures of rough set analysis such as reduct analysis and rule generation, we have implemented statistical methods to evaluate the prediction quality of such analysis, and procedures to investigate the change the granularity within single attributes. 1 A brief introduction to the rough set model The rough set model for data analysis has been developed by Z. Pawlak and his co–workers since the early 1980s (5). The original idea behind the model is the assumption that – in many situations – objects within a given population can only be distinguished up to a set of features. This may be due to subjective causes such as vagueness or to objective ones such as measuring errors or insufficient knowledge. In these situations, sets can only be roughly described via upper and lower approximations which are induced by the classes of an equivalence relation on the population. The main thrust of applications of rough set analysis, however, is knowledge discovery in databases. Many of examples of real – life applications of rough set analysis can be found in (9). Knowledge representation is done with information systems which are a tabular form of an O BJECT→ ATTRIBUTE VALUE relationship. If knowledge of objects is represented by attributes and their values, it ∗Equal authorship implied 1An acronym for the German expression ” GrobmengenInformations –Analysator”. An adequate English translation is ROUGHIAN –RoughsetInformationAnalyzer. is important to find the relationships among the attributes. Knowing dependencies simplifies the original information system, reduces computational overhead and, at times, may indicate causal relationship. If P is a set of attributes, and d is an attribute, we writeP → d to express the dependency (also called a rule): Whenever two objects agree on all attributes in P , then they agree on d. A reductwith respect to a (dependent) attribute d is a setQ of attributes, minimal with respect to the property thatQ → d. The intersection of all reducts is called the core. Determination of reducts and the core are at the heart of rough set data analysis. Attributes within the core are necessary for the representation of the decision attribute, whereas an empty core signals a high substitution rate among the attributes. Since it is based on equivalence relations, rough set analysis needs only internal information and does not rely on additional model assumptions as fuzzy set methods or probabilistic models do. In other words, instead of using external numbers or other additional parameters, rough set analysis utilizes only the structure of the given data and its inherent metrics. A widely used metric is the approximation qualityof a partition with respect to another one; this is, roughly speaking, the ratio of the number of all correctly classified elements to the total number of objects. The approximation quality of an attribute is measured by the drop of the approximation quality when the attribute is removed from the set of (independent) attributes under consideration. A more detailed overview of the rough set method can be found in (1) and a comprehensive presentation is the monograph (6). We have used GROBIAN to (re–) analyze the results of two applications of rough set analysis: The duodenal ulcer data of (7), and the earthquake data of (11). They search for premonitory factors of earthquakes by emphasizing gas geochemistry. 2 GROBIAN: A short program description GROBIAN uses a few functions of the RSL-library (RSL; 4), a collection of C-coded routines which cover a broad range of problems in rough set analysis. We have enhanced the library by several new functions, and have ported it to C ++ classes which resulted in a more flexible, more stable and more transparent coding of the functions. The design of GROBIAN allows any procedure implemented in the RSL to be transformed easily into a WIN3.x/WIN95 user interface. In addition, we have implemented new theoretical results such as granularity analysis (2) and statistical validation of dependencies (3), both of which are described below. After startup, GROBIAN needs a file containing the Rough InformationSystem (RIS) structure. The RIS file is in ASCII format, and contains all information needed to perform the data analysis There are several ways to produce such a file via the file menu.Whereas in the standard C-code one needs to look after the functions (e.g. permute) to know which information system is really in use, the classes in the C-code immediately show what is going on. Another advantage of using C -classes is the easiness to implement checks whether a function is applicable at all. 3 Major tasks of rough set analysis 3.1 Coding and filtering Any data analysis starts with so called ”raw data”. These are unfiltered measurements of attributes within the domain under investigation. Rough set analysis – like other types of analysis – needs a preprocessing step whose result is ”data” suitable for further analysis. GROBIAN provides two basic preprocessing steps: • Data conversion to convert attribute intervals of groups of attributes into coded equivalence classes and • Data ranges to define suitable ranges of (groups of) attributes for further analysis. The result of both procedures can be stored temporarily to perform an experimental filtering, or permanently to define a new information system where the filtered data is considered to be the new raw data. The ”View data” option of GROBIAN presents the raw data as well as their filtered counterpart, which enables the researcher to control the results of the filter procedures. Because equivalence relations are the only data type allowed in rough set analysis, processing starts with the construction of suitable equivalence classes induced by an attribute by identifying objects which have the same value with respect to this attribute. 3.2 Finding reducts and core GROBIAN’s search menu provides the items ”Reducts” and ”Core”. As an example we demonstrate how GROBIAN handles the analysis of the HSV data ((8)). There is only one dependent variable to define the classification, namely, the ”Visick_coding” attribute which signifies the healing success; all other attributes are viewed as independent variables. GROBIAN searches all reducts which describe the data up to a predefined ”minimal approximation quality”. If the ”minimal approximation quality” equals 1, the prediction of the dependent variable(s) must be perfect, while lower values indicate that we are willing to allow an error margin in the prediction. GROBIAN enables the researcher to perform reduct and core analysis within his information system using a systematic decrease of the minimal approximation quality. Because the core is defined as the intersection of all reducts of the information system, a reduct analysis can always determine the core as well. The analysis of the HSCA data shows that there is no perfect prediction of ”Visick_coding”, or in other words, C(1.00) = ∅. If we use a minimal approximation quality of 0.91, the corresponding core is empty C(0.91) = {∅}. As a starting point for further analysis, we analyze the core C(0.94) = {3, 4, 5, 6, 8, 10} and, because a core need not be a good candidate for a reduct with high approximation quality, reducts ”close to” this core such as the reduct {2, 3, 4, 5, 8, 9} suggested by (7). Since the first step of Rough Set Analysis is always the determination of the core, and since the computation of the core can be done more efficiently than using the intersection of all reducts, GROBIAN offers an extra entry for the core computation. The coreC(1.00) of the earthquake information system is empty, which indicates that the data filtering process is not complete. The next section shows how one can proceed in such a situation. 3.3 Analysis of the empty core situation An analysis of the earthquake data with the rough set approach discovers that there is an empty core for the dependent attribute ”Seismic activity”. One possibility to solve this problem is to eliminate some of the attribute values, and check whether the new system has a core. Because rough set analysis gives no hint which values should be eliminated, we have suggested a strategy how values of the (non-decision) attributes can be identified without losing information with respect to the decision attributes (2). The result of this analysis is a data driven filter procedure which produces additional coding rules within the attributes. We call the result of this procedure a rough filter. The GROBIAN analysis discovers that the granularity of some of the attributes is too high, and that some variables can be recoded without loss of information. GROBIAN proposes the following transformations: • Radon11 should by filtered by {1, 2} → {1} and{3, 4, 5} → {2}, • Radon21 should by handled by the same filter procedure, • Radon32 should by filtered by {1, 5} → {1} and{2, 3, 4} → {2}, • Radon62 should by filtered by {1, 4} → {1} and{2, 3, 5} → {2}, • Atmospheric pressure should by filtered by {1, 2, 3} → {1} and{4, 5} → {2}. Although rough filtering aims to lower the exchangeability of rules, there is no guarantee that the core of the filtered information system is not empty as well (as in the case of our example). Nevertheless, the researcher gains further insight into the structure of the data and may obtain ideas how to perform the coding and filtering procedures which may result in a more suitable information system. 3.4 Statistical analysis of a reduct A high approximation quality is not a guarantee that the result of a rough set analysis is valid, and attention needs to be paid to the underlying statistical assumptions of the model. If, for
[1]
Ivo Düntsch,et al.
Statistical evaluation of rough set dependency analysis
,
1997,
Int. J. Hum. Comput. Stud..
[2]
Roman Slowinski,et al.
Sensitivity Analysis of Rough Classification
,
1990,
Int. J. Man Mach. Stud..
[3]
Jacques Teghem,et al.
Use of "Rough Sets" Method to Draw Premonitory Factors for Earthquakes by Emphasing Gas Geochemistry: The Case of a Low Seismic Activity Context, in Belgium
,
1992,
Intelligent Decision Support.
[4]
Ivo Düntsch,et al.
Simple data filtering in rough set systems
,
1998,
Int. J. Approx. Reason..
[5]
Z. Pawlak.
Rough Sets: Theoretical Aspects of Reasoning about Data
,
1991
.
[6]
R. Słowiński.
Intelligent Decision Support: Handbook of Applications and Advances of the Rough Sets Theory
,
1992
.
[7]
Roman Slowinski,et al.
'Roughdas' and 'Roughclass' Software Implementations of the Rough Sets Approach
,
1992,
Intelligent Decision Support.
[8]
Janusz Zalewski,et al.
Rough sets: Theoretical aspects of reasoning about data
,
1996
.