Socio economic characterization of student's data using ICA and cluster analysis

Data mining is an automated process of discovering knowledge from databases. There are various kinds of data mining methods aiming to search for different kinds of knowledge. Data mining systems induce knowledge from data sets, which are huge, noisy (incorrect), incomplete, inconsistent, imprecise (fuzzy), and uncertain. The problem is that existing systems use a limiting attribute value language for representing the training examples and induced knowledge. Furthermore, some important patterns are ignored because they are statistically insignificant. Independent component analysis (ICA) can be used as tool in extraction of features in large data sets. Optimization of the objective functions like — mutual information, joint entropy, negentropy, kurtosis etc., lead to the iterative algorithms for ICA. In this paper a new approach is taken for identification of data attributes and socio economic characterization of data on the basis of ICA and cluster analysis. The approach is based entirely on measured entropies of the system and minimization of mutual information. The technique has been applied to efficiently extract the independent components or data attributes from a large data set. The sample data set is obtained from scanned OMR application forms of candidates applying for various courses in an Indian University, which provides educational services to various sections of society. By using cluster analysis technique useful results have been found which can be used for the socio economic characterization of candidates applying for engineering courses in the University. Simulations with such data have been presented to show the effect.