Predicting tumour stages of lung cancer adenocarcinoma tumours from pooled microarray data using machine learning methods

This paper involved a novel method combination of predicting lung cancer adenocarcinoma stages using differential expression analysis for gene selection (linear modelling) and machine learning methods (support vector machines (SVMs) and random forest) on a pooled dataset from multiple publicly available microarray experiments. The raw data of 123 tumour microarray samples were initially preprocessed and analysed using robust multi-array average (RMA) and linear models for microarray data (LIMMA) to screen a list of significantly differential expressed genes, where two gene lists were identified according to different experimental settings. These two gene lists were then placed into the SVM model and random forest (RF) model for further investigation to build the prediction models. As result, both the SVM and RF models provided a lung cancer stage prediction model with the accuracy ranging from 67% to 71%.