论文信息 - Robust Distance Measures for kNN Classification of Cancer Data

Robust Distance Measures for kNN Classification of Cancer Data

The k-Nearest Neighbor (kNN) classifier represents a simple and very general approach to classification. Still, the performance of kNN classifiers can often compete with more complex machine-learning algorithms. The core of kNN depends on a “guilt by association” principle where classification is performed by measuring the similarity between a query and a set of training patterns, often computed as distances. The relative performance of kNN classifiers is closely linked to the choice of distance or similarity measure, and it is therefore relevant to investigate the effect of using different distance measures when comparing biomedical data. In this study on classification of cancer data sets, we have used both common and novel distance measures, including the novel distance measures Sobolev and Fisher, and we have evaluated the performance of kNN with these distances on 4 cancer data sets of different type. We find that the performance when using the novel distance measures is comparable to the performance with more well-established measures, in particular for the Sobolev distance. We define a robust ranking of all the distance measures according to overall performance. Several distance measures show robust performance in kNN over several data sets, in particular the Hassanat, Sobolev, and Manhattan measures. Some of the other measures show good performance on selected data sets but seem to be more sensitive to the nature of the classification data. It is therefore important to benchmark distance measures on similar data prior to classification to identify the most suitable measure in each case.

Rezvan Ehsani | Finn Drabløs | F. Drabløs | Rezvan Ehsani

[1] T. Stamey,et al. Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. II. Radical prostatectomy treated patients. , 1989, The Journal of urology.

[2] Yundong Wub,et al. AN ALGORITHM FOR REMOTE SENSING IMAGE CLASSIFICATION BASED ON ARTIFICIAL IMMUNE B – CELL NETWORK , 2008 .

[3] T. Villmann. Sobolev Metrics for Learning of Functional Data-Mathematical and Theoretical Aspects Report 03 / 2007 , 2007 .

[4] Mandeep Singh,et al. A Review of Data Classification Using K-Nearest Neighbour Algorithm , 2013 .

[5] Ahmad Basheer Hassanat,et al. Dimensionality Invariant Similarity Measure , 2014, ArXiv.

[6] Punam Mulak,et al. Analysis of Distance Measures Using K-Nearest Neighbor Algorithm on KDD Dataset , 2015 .

[7] Ali A. Ghorbani,et al. A detailed analysis of the KDD CUP 99 data set , 2009, 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications.

[8] Xiangjun Shen,et al. A New Nearest Centroid Neighbor Classifier Based on K Local Means Using Harmonic Mean Distance , 2018, Inf..

[9] Quang Nguyen,et al. Human Computer Interaction Using Hand Gestures , 2014, ICIC.

[10] Richard W. Hamming,et al. Error detecting and error correcting codes , 1950 .

[11] Roberto Todeschini,et al. Distances and Other Dissimilarity Measures in Chemometrics , 2015 .

[12] Chih-Fong Tsai,et al. The distance function effect on k-nearest neighbor classification for medical datasets , 2016, SpringerPlus.

[13] T. Sørensen,et al. A method of establishing group of equal amplitude in plant sociobiology based on similarity of species content and its application to analyses of the vegetation on Danish commons , 1948 .

[14] Tingting Zhou,et al. TopEVM: Using Co-occurrence and Topology Patterns of Enzymes in Metabolic Networks to Construct Phylogenetic Trees , 2008, PRIB.

[15] Carlos Oliveira,et al. A dropout-regularized classifier development approach optimized for precision medicine test discovery from omics data , 2019, BMC Bioinformatics.

[16] G. N. Lance,et al. Mixed-Data Classificatory Programs I - Agglomerative Systems , 1967, Aust. Comput. J..

[17] Robert M. Nowak,et al. Comparison of kNN and k-means optimization methods of reference set selection for improved CNV callers performance , 2019, BMC Bioinformatics.

[18] Eulalia Szmidt,et al. Distances and Similarities in Intuitionistic Fuzzy Sets , 2013, Studies in Fuzziness and Soft Computing.

[19] Jun Cheng,et al. brain tumor dataset , 2016 .