Automated Determination of the Input Parameter of DBSCAN Based on Outlier Detection

During the last two decades, DBSCAN (Density-Based Spatial Clustering of Applications with Noise) has been one of the most common clustering algorithms, that is also highly cited in the scientific literature. However, despite its strengths, DBSCAN has a shortcoming in parameter detection, which is done in interaction with the user, presenting some graphical representation of the data. This paper introduces a simple and effective method for automatically determining the input parameter of DBSCAN. The idea is based on a statistical technique for outlier detection, namely the empirical rule. This work also suggests a more accurate method for detecting the clusters that lie close to each other. Experimental results in comparison with the old method, together with the time complexity of the algorithm, which is the same as for the old algorithm, indicate that the proposed method is able to automatically determine the input parameter of DBSCAN quite reliably and efficiently.

[1]  Ronnie Johansson,et al.  Choosing DBSCAN Parameters Automatically using Differential Evolution , 2014 .

[2]  B. Everitt,et al.  Cluster Analysis: Everitt/Cluster Analysis , 2011 .

[3]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[4]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[5]  Wang Peng,et al.  Grid-based DBSCAN Algorithm with Referential Parameters , 2012 .

[6]  H. Theil Introduction to econometrics , 1978 .

[7]  Vladimir Estivill-Castro,et al.  Fast and Robust General Purpose Clustering Algorithms , 2000, PRICAI.

[8]  Ken Black,et al.  Business Statistics: Contemporary Decision Making , 1994 .

[9]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[10]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[11]  C. B. Gupta An Introduction to Statistical Methods , 2004 .

[12]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[13]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[14]  Frederick L. Coolidge,et al.  Statistics: A Gentle Introduction , 2000 .

[15]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[16]  T. Ferryman,et al.  Data outlier detection using the Chebyshev theorem , 2005, 2005 IEEE Aerospace Conference.

[17]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[18]  Z. Elouedi,et al.  DBSCAN-GM: An improved clustering method based on Gaussian Means and DBSCAN techniques , 2012, 2012 IEEE 16th International Conference on Intelligent Engineering Systems (INES).