Utilization of virtual samples to facilitate cancer identification for DNA microarray data in the early stages of an investigation

DNA microarray datasets are generally small in size, high dimensional with many non-discriminative genes, and non-linear with outliers. Their size/dimension ratio suggests that DNA microarray datasets are identified as small-sample problems. Recently, researchers have developed various gene selection algorithms to discover genes that are most relevant to a specific disease, and thus to reduce computation. Most gene selection algorithms improve learning performance and efficiency, but still suffer from the limitation of insufficient training samples in the datasets. Moreover, in the early stage of diagnosing a new disease, very limited data can be obtained. Therefore, the derived diagnostic model is usually unreliable to identify the new disease. Consequently, the diagnostic performance cannot always be robust, even with the gene selection algorithms. To solve the problem of very limited training dataset with non-linear data or outliers, we propose the method GVSG (Group Virtual Sample Generation), which is a non-linear Virtual Sample Generation algorithm. This non-linear method detects the characteristics in the very limited data, forms discrete groups of each discriminative gene, and systematically generates virtual samples for each of these to accelerate and stabilize the modeling process. The results show that this method significantly improves the learning accuracy in the early stage of DNA microarray data.

[1]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Martin T. Hagan,et al.  Neural network design , 1995 .

[3]  T. Golub,et al.  Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. , 2003, Cancer research.

[4]  Lipo Wang,et al.  Gene expression data analysis using support vector machines , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[5]  Sarunas Raudys Trainable fusion rules. II. Small sample-size effects , 2006, Neural Networks.

[6]  Jiawei Han,et al.  Cancer classification using gene expression data , 2003, Inf. Syst..

[7]  Tao Li,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004, Bioinform..

[8]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[9]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[10]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[11]  Tzu-Tsung Wong,et al.  Two-stage classification methods for microarray data , 2008, Expert Syst. Appl..

[12]  Der-Chiang Li,et al.  A new method to help diagnose cancers for small sample size , 2007, Expert Syst. Appl..

[13]  Chun-Wu Yeh,et al.  Acquiring knowledge with limited experience , 2007, Expert Syst. J. Knowl. Eng..

[14]  Fillia Makedon,et al.  HykGene: a hybrid approach for selecting marker genes for phenotype classification using microarray gene expression data , 2005, Bioinform..

[15]  Danh V. Nguyen,et al.  Tumor classification by partial least squares using microarray gene expression data , 2002, Bioinform..

[16]  Kari Torkkola,et al.  Self-organizing maps in mining gene expression data , 2001, Inf. Sci..

[17]  Der-Chiang Li,et al.  Using virtual sample generation to build up management knowledge in the early manufacturing stages , 2006, Eur. J. Oper. Res..

[18]  Ito Wasito,et al.  Nearest neighbour approach in the least-squares data imputation algorithms , 2005, Inf. Sci..