Symbolic and Subsymbolic Machine Learning Approaches for Molecular Classification of Cancer and Ranking of Genes

Background: Classification of human tumors into distinguishable entities is preferentially based on clinical, pathohistological, enzyme-based histochemical, immunohistochemical, and in some cases cytogenetic data. This classification system still provides classes containing tumors that show similarities but differ strongly in important aspects, e.g. clinical course, treatment response, or survival. Thus, information obtained by new techniques like cDNA microarrays that are profiling gene expression in tissues might be beneficial for this dilemma. Microarray experiments, however, provide the scientific community with an immense amount of data. Without appropriate analysis tools significant perceptions hidden in the pool of data might not be recognized. Therefore, methods capable of handling large data sets of thousands of attributes are demanded. Method: Based on microarray gene expression, we investigate two popular machine learning techniques in the context of molecular classification of cancer, identification of most informative genes and predicition of clinically relevant parameters. The techniques in question are (1) decision trees (symbolic approach) and (2) artificial neural networks (subsymbolic approach). As a basis for our comparative study we have chosen two of the most popular algorithms in machine learning software, namely the decision tree/rule induction algorithms C5.0 and the well-known backpropagation algorithm for multilayer perceptrons (MLP), a specific architecture of artificial neural networks (ANN) [2,3,4]. For both algorithms we used the proprietary implementation realized in the data mining tool Clementine from SPSS [5]. Decision trees are advantageous in situations where the complexity is relatively low (small number of variables and low degree of dependencies between variables) and the variables are directly interpretable by humans (numeric variables such as age, cholesterol, and symbolic variables such as gender, tumor stage etc.). Artificial neural networks on the other hand have been found useful in situations where there are many interacting variables (e.g., images) and non-linear behavior of the underlying phenomena. We used all expression data (except the control data) without further processing