Propositional Satisfiability Method in Rough Classification Modeling for Data Mining

The fundamental problem in data mining is whether the whole information available is always necessary to represent the information system (IS). The goal of data mining is to find rules that model the world sufficiently well. These rules consist of conditions over attributes value pairs called description and classification of decision attribute. However, the set of all decision rules generated from all conditional attributes can be too large and can contain many chaotic rules that are not appropriate for unseen object classification. Therefore the search for the best rules must be performed because it is not possible to determine the quality of all rules generated from the information systems. In rough set approach to data mining, the set of interesting rules are determined using a notion of reduct. Rules were generated from reducts through binding the condition attribute values of the object class from which the reduct is originated to the corresponding attribute. It is important for the reducts to be minimum in size. The minimal reducts will decrease the size of the conditional attributes used to generate rules. Smaller size of rules are expected to classify new cases more properly because of the larger support in data and in some sense the most stable and frequently appearing reducts gives the best decision rules. The main work of the thesis is the generation of classification model that contains smaller number of rules, shorter length and good accuracy. The propositional satisfiability method in rough classification model is proposed in this thesis. Two models, Standard Integer Programming (SIP) and Decision Related Integer Programming (DRIP) to represent the minimal reduct computation problem were proposed. The models involved a theoretical formalism of the discemibility relation of a decision system (DS) into an Integer Programming (IP) model. The proposed models were embedded within the default rules generation framework and a new rough classification method was obtained. An improved branch and bound strategy is proposed to solve the SIP and DRIP models that pruned certain amount of search. The proposed strategy used the conflict analysis procedure to remove the unnecessary attribute assignments and determined the branch level for the search to backtrack in a nonchronological manner. Five data sets from VCI machine learning repositories and domain theories were experimented. Total number rules generated for the best classification model is recorded where the 30% of data were used for training and 70% were kept as test data. The classification accuracy, the number of rules and the maximum length of rules obtained from the SIPIDRIP method was compared with other rough set method such as Genetic Algorithm (GA), Johnson, Holte l R, Dynamic and Exhaustive method. Four of the datasets were then chosen for further experiment. The improved search strategy implemented the non-chronological backtracking search that potentially prunes the large portion of search space. The experimental results showed that the proposed SIPIDRIP method is a successful method in rough classification modeling. The outstanding feature of this method is the reduced number of rules in all classification models. SIPIDRIP generated shorter rules among other methods in most dataset. The proposed search strategy indicated that the best performance can be achieved at the lower level or shorter path of the tree search. SIPIDRIP method had also shown promising across other commonly used classifiers such as neural network and statistical method. This model is expected to be able to represent the knowledge of the system efficiently.