Relevant Attribute Discovery in High Dimensional Data Based on Rough Sets Applications to Leukemia Gene Expressions

A pipelined approach using two clustering algorithms in combination with Rough Sets is investigated for the purpose discovering important combination of attributes in high dimensional data. In many domains, the data objects are described in terms of a large number of features, like in gene expression experiments, or in samples characterized by spectral information. The Leader and several k-means algorithms are used as fast procedures for attribute set simplification of the information systems presented to the rough sets algorithms. The data submatrices described in terms of these features are then discretized w.r.t the decision attribute according to different rough set based schemes. From them, the reducts and their derived rules are extracted, which are applied to test data in order to evaluate the resulting classification accuracy. An exploration of this approach (using Leukemia gene expression data) was conducted in a series of experiments within a high-throughput distributed-computing environment. They led to subsets of genes with high discrimination power. Good results were obtained with no preprocessing applied to the data.

[1]  J. Gower A General Coefficient of Similarity and Some of Its Properties , 1971 .

[2]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[3]  Jakub Wroblewski,et al.  Ensembles of Classifiers Based on Approximate Reducts , 2001, Fundam. Informaticae.

[4]  Julio J. Valdés,et al.  Gene discovery in leukemia revisited: a computational intelligence perspective , 2004 .

[5]  Aleksander Ohrn,et al.  ROSETTA -- A Rough Set Toolkit for Analysis of Data , 1997 .

[6]  Moonis Ali,et al.  Innovations in Applied Artificial Intelligence , 2005 .

[7]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[8]  Zbigniew W. Ras,et al.  Methodologies for Intelligent Systems , 1991, Lecture Notes in Computer Science.

[9]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[10]  Janusz Zalewski,et al.  Rough sets: Theoretical aspects of reasoning about data , 1996 .

[11]  Fazel Famili,et al.  Data Mining: Understanding Data and Disease Modeling , 2003, Applied Informatics.

[12]  Andrzej Skowron,et al.  Dynamic Reducts as a Tool for Extracting Laws from Decisions Tables , 1994, ISMIS.

[13]  Marc Roubens,et al.  Theory and Applications of Relational Structures as Knowledge Instruments II, International Workshops of COST Action 274, TARSKI, 2002-2005, Selected Revised Papers , 2006, Theory and Applications of Relational Structures as Knowledge Instruments.

[14]  I. Borg Multidimensional similarity structure analysis , 1987 .

[15]  John W. Sammon,et al.  A Nonlinear Mapping for Data Structure Analysis , 1969, IEEE Transactions on Computers.

[16]  Marcin S. Szczuka,et al.  A New Version of Rough Set Exploration System , 2002, Rough Sets and Current Trends in Computing.