Comparative analysis of density based outlier detection techniques on breast cancer data using hadoop and map reduce

Advancement of technology, has furnished several terabytes of data for companies which can be effectively summed under Data Mining. Finding useful pieces of information from such huge data has been the need of the hour. A term called Anomaly Detection [8] is used in the pretext to refer to data objects which do not confer to a notion of normal data objects. There are various density based clustering algorithms[10] used to categorize data objects as normal or anomalous by finding clusters within the data set. LOF[18] finds the anomalous data objects by finding local density of data objects with respect to local density of its neighbors. DBSCAN finds anomalous data objects by finding data objects surrounded by data objects (density) which are far away from the concerned data object. OPTICS an extension of DBSCAN finds clusters of arbitrary sizes. DENCLUE uses a set of density distribution functions. This paper shows the comparison of the density based algorithms i.e. LOF, OPTICS, DBSCAN, DENCLUE based upon parameters such as time taken on single cluster hadoop, noise accuracy detection level, number of anomalous instances detected on high dimensional data, handle varied density, input parameters and complexity etc.

[1]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[2]  A. Raftery,et al.  Nearest-Neighbor Clutter Removal for Estimating Features in Spatial Point Processes , 1998 .

[3]  Dr. Chandra,et al.  A Survey on Clustering Algorithms for Data in Spatial Database Management Systems , 2011 .

[4]  Houkuan Huang,et al.  A Grid-Based Clustering Algorithm for Network Anomaly Detection , 2007, The First International Symposium on Data, Privacy, and E-Commerce (ISDPE 2007).

[5]  Chandra.E,et al.  A Survey on Clustering Algorithms for Data in Spatial Database Management Systems , 2011 .

[7]  Peng Liu,et al.  VDBSCAN: Varied Density Based Spatial Clustering of Applications with Noise , 2007, 2007 International Conference on Service Systems and Service Management.

[8]  M. Punithavalli,et al.  Improved varied density based spatial clustering algorithm with noise , 2010, 2010 IEEE International Conference on Computational Intelligence and Computing Research.

[9]  Peide Liu Research on Risk Evaluation for Venture Capital Based on Intuitionistic Fuzzy Set and TOPSIS , 2007, The First International Symposium on Data, Privacy, and E-Commerce (ISDPE 2007).

[10]  Karanjit Singh,et al.  Nearest Neighbour Based Outlier Detection Techniques , 2012 .

[11]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[12]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[13]  Anand Singh Jalal,et al.  A Density Based Algorithm for Discovering Density Varied Clusters in Large Spatial Databases , 2010 .

[14]  H. Edelsbrunner,et al.  Efficient algorithms for agglomerative hierarchical clustering methods , 1984 .

[15]  Sergio M. Savaresi,et al.  Cluster Selection in Divisive Clustering Algorithms , 2002, SDM.

[16]  M. Parimala,et al.  A Survey on Density Based Clustering Algorithms for Mining Large Spatial Databases , 2011 .

[17]  Hans-Peter Kriegel,et al.  OPTICS-OF: Identifying Local Outliers , 1999, PKDD.

[18]  Hitesh Gupta,et al.  A Review of Density-Based clustering in Spatial Data , 2012 .

[19]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[20]  Ingrid Russell,et al.  An introduction to the WEKA data mining system , 2006, ITICSE '06.