Relevant Attribute Discovery in High Dimensional Data: Application to Breast Cancer Gene Expressions

In many domains, the data objects are described in terms of a large number of features. The pipelined data mining approach introduced in [1] using two clustering algorithms in combination with rough sets and extended with genetic programming, is investigated with the purpose of discovering important subsets of attributes in high dimensional data. Their classification ability is described in terms of both collections of rules and analytic functions obtained by genetic programming (gene expression programming). The Leader and several k-means algorithms are used as procedures for attribute set simplification of the information systems later presented to rough sets algorithms. Visual data mining techniques including virtual reality were used for inspecting results. The data mining process is setup using high throughput distributed computing techniques. This approach was applied to Breast Cancer microarray data and it led to subsets of genes with high discrimination power with respect to the decision classes

[1]  John W. Sammon,et al.  A Nonlinear Mapping for Data Structure Analysis , 1969, IEEE Transactions on Computers.

[2]  Aleksander Ohrn,et al.  ROSETTA -- A Rough Set Toolkit for Analysis of Data , 1997 .

[3]  Julio J. Valdés,et al.  Relevant Attribute Discovery in High Dimensional Data Based on Rough Sets Applications to Leukemia Gene Expressions , 2005 .

[4]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[5]  Z. Pawlak Rough Sets: Theoretical Aspects of Reasoning about Data , 1991 .

[6]  Candida Ferreira Gene expression programming , 2006 .

[7]  Cândida Ferreira,et al.  Gene Expression Programming: Mathematical Modeling by an Artificial Intelligence (Studies in Computational Intelligence) , 2006 .

[8]  Syed Mohsin,et al.  Gene expression profiling for the prediction of therapeutic response to docetaxel in patients with breast cancer , 2003, The Lancet.

[9]  J. G. Carbonell,et al.  Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing , 2003, Lecture Notes in Computer Science.

[10]  Julio J. Valdés,et al.  Gene Discovery in Leukemia Revisited: A Computational Intelligence Perspective , 2004, IEA/AIE.

[11]  Janusz Zalewski,et al.  Rough sets: Theoretical aspects of reasoning about data , 1996 .

[12]  Cândida Ferreira,et al.  Gene Expression Programming: Mathematical Modeling by an Artificial Intelligence , 2014, Studies in Computational Intelligence.

[13]  J. Gower A General Coefficient of Similarity and Some of Its Properties , 1971 .

[14]  Julio J. Valdés,et al.  Virtual Reality Representation of Information Systems and Decision Rules: An Exploratory Technique for Understanding Data and Knowledge Structure , 2003, RSFDGrC.

[15]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .