Discovering Local Outliers using Dynamic Minimum Spanning Tree with Self-Detection of Best Number of Clusters

Detecting outliers in database (as unusual objects) using Clustering and Distance-based approach is a big desire. Minimum spanning tree based clustering algorithm is capable of detecting clusters with irregular boundaries. In this paper we propose a new algorithm to detect outliers based on minimum spanning tree clustering and distance-based approach. Outlier detection is an extremely important task in a wide variety of application. The algorithm partition the dataset into optimal number of clusters. Small clusters are then determined and considered as outliers. The rest of the outliers (if any) are then detected in the clusters using Distance-based method. The algorithm uses a new cluster validation criterion based on the geometric property of data partition of the dataset in order to find the proper number of clusters. The algorithm works in two phases. The first phase of the algorithm creates optimal number of clusters, where as the second phase of the algorithm detect outliers in the clusters. The key

[1]  S. Victor,et al.  A Novel Algorithm for Meta Similarity Clusters Using Minimum Spanning Tree , 2010 .

[2]  Fawaz A. Masoud,et al.  Fast Algorithms for Outlier Detection , 2008 .

[3]  Junliang Chen,et al.  ODDC: Outlier Detection Using Distance Distribution Clustering , 2007, PAKDD Workshops.

[4]  Doo-Hwan Bae,et al.  An Approach to Outlier Detection of Software Measurement Data using the K-means Clustering Method , 2007, First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007).

[5]  Yan Zhou,et al.  Minimum Spanning Tree Based Clustering Algorithms , 2006, 2006 18th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'06).

[6]  Philip Chan,et al.  Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[7]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[8]  David M. Rocke,et al.  Sampling and Subsampling for Cluster Analysis in Data Mining: With Applications to Sky Survey Data , 2003, Data Mining and Knowledge Discovery.

[9]  William Bialek,et al.  How Many Clusters? An Information-Theoretic Perspective , 2003, Neural Computation.

[10]  Carlos Soares,et al.  Outlier Detection using Clustering Methods: a data cleaning application , 2004 .

[11]  Bonnie Ghosh-Dastider,et al.  Outlier Detection and Editing Procedures for Continuous Multivariate Data , 2003 .

[12]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[13]  Zengyou He,et al.  Discovering cluster-based local outliers , 2003, Pattern Recognit. Lett..

[14]  atherine,et al.  Finding the number of clusters in a data set : An information theoretic approach C , 2003 .

[15]  Hongxing He,et al.  A comparative study of RNN for outlier detection in data mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[16]  Shian-Shyong Tseng,et al.  Two-phase clustering process for outliers detection , 2001, Pattern Recognit. Lett..

[17]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[18]  Jaroslav Nesetril,et al.  Otakar Boruvka on minimum spanning tree problem Translation of both the 1926 papers, comments, history , 2001, Discret. Math..

[19]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[20]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[21]  Raymond T. Ng,et al.  A Unified Notion of Outliers: Properties and Computation , 1997, KDD.

[22]  Philip N. Klein,et al.  A randomized linear-time algorithm to find minimum spanning trees , 1995, JACM.

[23]  N. B. Venkateswarlu,et al.  A new fast classifier for remotely sensed imagery , 1993 .

[24]  Sin-Horng Chen,et al.  FAST ALGORITHM FOR VQ CODEBOOK DESIGN , 1991 .

[25]  Michael L. Fredman,et al.  Trans-dichotomous algorithms for minimum spanning trees and shortest paths , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[26]  Kuldip K. Paliwal,et al.  Effect of ordering the codebook on the efficiency of the partial distance search algorithm for vector quantization , 1989, IEEE Trans. Commun..

[27]  Tetsuo Asano,et al.  Clustering algorithms based on minimum and maximum spanning trees , 1988, SCG '88.

[28]  David Avis,et al.  Diameter partitioning , 1986, Discret. Comput. Geom..

[29]  Robert E. Tarjan,et al.  Efficient algorithms for finding minimum spanning trees in undirected and directed graphs , 1986, Comb..

[30]  Robert M. Gray,et al.  An Improvement of the Minimum Distortion Encoding Algorithm for Vector Quantization , 1985, IEEE Trans. Commun..

[31]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[32]  Charles T. Zahn,et al.  Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters , 1971, IEEE Transactions on Computers.

[33]  R. Prim Shortest connection networks and some generalizations , 1957 .