A Comparative Study of Distance Metric Learning to Find Sub-categories of Minority Class from Imbalance Data

Imbalance class problem in data mining occurs where one class known as the minority class has the significantly lower number of samples than the other classes known as the majority class(es). It affects the performance of machine learning algorithms by allowing them to show bias towards the majority class. This occurs because of the sub-concepts from the minority class. Recent studies has further divided the minority class into four sub-concepts: Safe, Borderline, Rare and Outlier using the majority-minority proportion at the neighborhood of every minority sample. Among the sub-concepts safe are easy to identify and classifiers are increasing inaccurate while classifying other subsequent sub-categories (Boarderline, Rare, Outlier). In some recent studies, heterogeneous value difference metric is used as the distance calculation mechanism for categorizing data. However, there are numerous other distance metrics whose effects on determining the sub-concepts are not explored yet. This research aimed at evaluating the effects of different distance metrics in the calculation of different sub-concepts within the minority class data. We have considered ten datasets and five distance metrics for the calculation. The datasets are divided into three categories: all categorical, mixed and fully numeric data. For the datasets with more categorical data outputs hugely differs between the distance functions. In those cases, relatively safer examples are calculated by the Euclidean and the Manhattan distance function. Our study shows that for categorizing minority data distance metrics should be chosen dataset wise.

[1]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[2]  Ladan Tahvildari,et al.  Self-adaptive software: Landscape and research challenges , 2009, TAAS.

[3]  Khalil el Hindi,et al.  Specific-class distance measures for nominal attributes , 2013, AI Commun..

[4]  Hong Jia,et al.  A New Distance Metric for Unsupervised Learning of Categorical Data , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[5]  Diab M. Diab,et al.  Using differential evolution for improving distance measures of nominal values , 2018, Appl. Soft Comput..

[6]  K. Becker,et al.  Analysis of microarray data using Z score transformation. , 2003, The Journal of molecular diagnostics : JMD.

[7]  E. Mjolsness,et al.  Clustering with a Domain-speciic Distance Measure , 1994 .

[8]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[9]  Chih-Fong Tsai,et al.  The distance function effect on k-nearest neighbor classification for medical datasets , 2016, SpringerPlus.

[10]  Sarah Jane Delany k-Nearest Neighbour Classifiers , 2007 .

[11]  Tony R. Martinez,et al.  Improved Heterogeneous Distance Functions , 1996, J. Artif. Intell. Res..

[12]  Eric Mjolsness,et al.  Clustering with a Domain-Specific Distance Measure , 1993, NIPS.

[13]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[14]  Jerzy Stefanowski,et al.  Types of minority class examples and their influence on learning classifiers from imbalanced data , 2015, Journal of Intelligent Information Systems.

[15]  David L. Waltz,et al.  Toward memory-based reasoning , 1986, CACM.

[16]  Hongwei Li,et al.  A Survey of Distance Metrics for Nominal Attributes , 2010, J. Softw..

[17]  Lipika Dey,et al.  A k-mean clustering algorithm for mixed numeric and categorical data , 2007, Data Knowl. Eng..

[18]  David Abend Analysis Of Symbolic Data Exploratory Methods For Extracting Statistical Information From Complex Data , 2016 .

[19]  Jerzy Stefanowski,et al.  Local Data Characteristics in Learning Classifiers from Imbalanced Data , 2018, Advances in Data Analysis with Computational Intelligence Methods.

[20]  Jerzy Stefanowski,et al.  Dealing with Data Difficulty Factors While Learning from Imbalanced Data , 2016, Challenges in Computational Statistics and Data Mining.

[21]  Vipin Kumar,et al.  Similarity Measures for Categorical Data: A Comparative Evaluation , 2008, SDM.