A hybrid algorithm of minimum spanning tree and nearest neighbor for classifying human cancers

Classification and prediction of different cancers based on gene expression profiles are important for cancer diagnosis, cancer treatment and medication discovery. The k nearest neighbor algorithm (k-NN) is one easy and efficient machine learning method for cancer classification and the parameter k is crucial. In this paper, we integrate minimum spanning tree (MST) and k nearest neighbor algorithm (k-NN) for cancer classification. The MST is designed for the selection of parameter k and the nearest neighbors for k-NN. Firstly we build a minimum spanning tree (MST) based on Euclidean distance between each two samples for gene expression data only including one unknown class sample. Secondly for unknown class sample in the gene expression data, we find the connected samples and then apply majority vote principle. Thirdly if there are tied votes then we expend the connected samples with the nearest neighbors for unknown class sample until all the samples are expended or the class for unknown sample is obtained. This hybrid algorithm is referred to as MSTNN. The hybrid algorithm MSTNN is compared with k-NN and other 3 existing classification algorithms on CNS dataset, Colon dataset and Lung dataset, 3 binary class gene expression datasets and 3 multi-class gene expression datasets: Leukemia1, Leukemia2, and Leukemia3 involving human cancers. The MSTNN algorithm improves 5.65% better than k-NN on average LOOCV accuracy and 13.80% better than k-NN on testing datasets classification average accuracy, and achieves the best performance in all the 5 algorithms. The results demonstrate that the proposed MSTNN algorithm is feasible to classify human cancers.

[1]  Stephen T. C. Wong,et al.  Cancer classification and prediction using logistic regression with Bayesian gene selection , 2004, J. Biomed. Informatics.

[2]  S. Ramaswamy,et al.  Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. , 2002, Cancer research.

[3]  Y Xu,et al.  Minimum spanning trees for gene expression data clustering. , 2001, Genome informatics. International Conference on Genome Informatics.

[4]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[5]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Ian H. Witten,et al.  Weka: Practical machine learning tools and techniques with Java implementations , 1999 .

[7]  Tung-Shou Chen,et al.  Fast Nearest Neighbor Classification using Class-Based Clustering , 2007, 2007 International Conference on Machine Learning and Cybernetics.

[8]  Jeremy J. W. Chen,et al.  Topology-based cancer classification and related pathway mining using microarray data , 2006, Nucleic acids research.

[9]  Hyunchul Ahn,et al.  Using genetic algorithms to optimize nearest neighbors for data mining , 2008, Ann. Oper. Res..

[10]  Yu Wang,et al.  A Fast KNN Algorithm for Text Categorization , 2007, 2007 International Conference on Machine Learning and Cybernetics.

[11]  Chunxia Zhao,et al.  Efficient K-nearest neighbors searching algorithms for unorganized cloud points , 2008, 2008 7th World Congress on Intelligent Control and Automation.

[12]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[13]  E. Lander,et al.  MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia , 2002, Nature Genetics.

[14]  T. S. Jackson,et al.  Theory of minimum spanning trees. I. Mean-field theory and strongly disordered spin-glass model. , 2009, Physical review. E, Statistical, nonlinear, and soft matter physics.

[15]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[16]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[17]  J. Sudbø,et al.  Gene-expression profiles in hereditary breast cancer. , 2001, The New England journal of medicine.

[18]  T. S. Jackson,et al.  Theory of minimum spanning trees. II. Exact graphical methods and perturbation expansion at the percolation threshold. , 2009, Physical review. E, Statistical, nonlinear, and soft matter physics.

[19]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[20]  Xavier Rodet,et al.  PCA-based branch and bound search algorithms for computing K nearest neighbors , 2003, Pattern Recognit. Lett..

[21]  Yang Song,et al.  IKNN: Informative K-Nearest Neighbor Pattern Classification , 2007, PKDD.

[22]  Andreas Bender,et al.  Melting Point Prediction Employing k-Nearest Neighbor Algorithms and Genetic Parameter Optimization , 2006, J. Chem. Inf. Model..

[23]  Robert Veroff,et al.  A Bayesian Network Classification Methodology for Gene Expression Data , 2004, J. Comput. Biol..