Robust Distance Measures for kNN Classification of Cancer Data

The k-Nearest Neighbor (kNN) classifier represents a simple and very general approach to classification. Still, the performance of kNN classifiers can often compete with more complex machine-learning algorithms. The core of kNN depends on a “guilt by association” principle where classification is performed by measuring the similarity between a query and a set of training patterns, often computed as distances. The relative performance of kNN classifiers is closely linked to the choice of distance or similarity measure, and it is therefore relevant to investigate the effect of using different distance measures when comparing biomedical data. In this study on classification of cancer data sets, we have used both common and novel distance measures, including the novel distance measures Sobolev and Fisher, and we have evaluated the performance of kNN with these distances on 4 cancer data sets of different type. We find that the performance when using the novel distance measures is comparable to the performance with more well-established measures, in particular for the Sobolev distance. We define a robust ranking of all the distance measures according to overall performance. Several distance measures show robust performance in kNN over several data sets, in particular the Hassanat, Sobolev, and Manhattan measures. Some of the other measures show good performance on selected data sets but seem to be more sensitive to the nature of the classification data. It is therefore important to benchmark distance measures on similar data prior to classification to identify the most suitable measure in each case.

[1]  T. Stamey,et al.  Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. II. Radical prostatectomy treated patients. , 1989, The Journal of urology.

[2]  Yundong Wub,et al.  AN ALGORITHM FOR REMOTE SENSING IMAGE CLASSIFICATION BASED ON ARTIFICIAL IMMUNE B – CELL NETWORK , 2008 .

[3]  T. Villmann Sobolev Metrics for Learning of Functional Data-Mathematical and Theoretical Aspects Report 03 / 2007 , 2007 .

[4]  Mandeep Singh,et al.  A Review of Data Classification Using K-Nearest Neighbour Algorithm , 2013 .

[5]  Ahmad Basheer Hassanat,et al.  Dimensionality Invariant Similarity Measure , 2014, ArXiv.

[6]  Punam Mulak,et al.  Analysis of Distance Measures Using K-Nearest Neighbor Algorithm on KDD Dataset , 2015 .

[7]  Ali A. Ghorbani,et al.  A detailed analysis of the KDD CUP 99 data set , 2009, 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications.

[8]  Xiangjun Shen,et al.  A New Nearest Centroid Neighbor Classifier Based on K Local Means Using Harmonic Mean Distance , 2018, Inf..

[9]  Quang Nguyen,et al.  Human Computer Interaction Using Hand Gestures , 2014, ICIC.

[10]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[11]  Roberto Todeschini,et al.  Distances and Other Dissimilarity Measures in Chemometrics , 2015 .

[12]  Chih-Fong Tsai,et al.  The distance function effect on k-nearest neighbor classification for medical datasets , 2016, SpringerPlus.

[13]  T. Sørensen,et al.  A method of establishing group of equal amplitude in plant sociobiology based on similarity of species content and its application to analyses of the vegetation on Danish commons , 1948 .

[14]  Tingting Zhou,et al.  TopEVM: Using Co-occurrence and Topology Patterns of Enzymes in Metabolic Networks to Construct Phylogenetic Trees , 2008, PRIB.

[15]  Carlos Oliveira,et al.  A dropout-regularized classifier development approach optimized for precision medicine test discovery from omics data , 2019, BMC Bioinformatics.

[16]  G. N. Lance,et al.  Mixed-Data Classificatory Programs I - Agglomerative Systems , 1967, Aust. Comput. J..

[17]  Robert M. Nowak,et al.  Comparison of kNN and k-means optimization methods of reference set selection for improved CNV callers performance , 2019, BMC Bioinformatics.

[18]  Eulalia Szmidt,et al.  Distances and Similarities in Intuitionistic Fuzzy Sets , 2013, Studies in Fuzziness and Soft Computing.

[19]  Jun Cheng,et al.  brain tumor dataset , 2016 .

[20]  M. C. Jones,et al.  E. Fix and J.L. Hodges (1951): An Important Contribution to Nonparametric Discriminant Analysis and Density Estimation: Commentary on Fix and Hodges (1951) , 1989 .

[21]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[22]  H S Khamis,et al.  APPLICATION OF k- NEAREST NEIGHBOUR CLASSIFICATION IN MEDICAL DATA MINING IN THE CONTEXT OF KENYA , 2014 .

[23]  Roberto Todeschini,et al.  A new concept of higher-order similarity and the role of distance/similarity measures in local classification methods , 2016 .

[24]  Guy Lebanon,et al.  Learning Riemannian Metrics , 2002, UAI.

[25]  Ahmad B. A. Hassanat,et al.  Effects of Distance Measure Choice on K-Nearest Neighbor Classifier Performance: A Review , 2019, Big Data.

[26]  David W. Aha,et al.  A Review and Empirical Evaluation of Feature Weighting Methods for a Class of Lazy Learning Algorithms , 1997, Artificial Intelligence Review.

[27]  John M. Barnard,et al.  Chemical Similarity Searching , 1998, J. Chem. Inf. Comput. Sci..

[28]  P. J. Clark,et al.  An Extension of the Coefficient of Divergence for Use with Multiple Characters , 1952 .

[29]  Nittaya Kerdprasop,et al.  An Empirical Study of Distance Metrics for k-Nearest Neighbor Algorithm , 2015 .

[30]  Peter Grabusts,et al.  The Choice of Metrics for Clustering Algorithms , 2015 .

[31]  S. Sameen Fatima,et al.  Text Categorization with K-Nearest Neighbor Approach , 2012 .

[32]  Harry Shum,et al.  Query Dependent Ranking Using K-nearest Neighbor * , 2022 .

[33]  G. N. Lance,et al.  Computer Programs for Hierarchical Polythetic Classification ("Similarity Analyses") , 1966, Comput. J..

[34]  Joachim Denzler,et al.  A Comparison of Nearest Neighbor Search Algorithms for Generic Object Recognition , 2006, ACIVS.