A genetic approach for efficient outlier detection in projected space

In this paper we present a genetic solution to the outlier detection problem. The essential idea behind this technique is to define outliers by examining those projections of the data, along which the data points have abnormal or inconsistent behavior (defined in terms of their sparsity values). We use a partitioning method to divide the data set into groups such that all the objects in a group can be considered to behave similarly. We then identify those groups that contain outliers. The algorithm assigns an 'outlier-ness' value that gives a relative measure of how strong an outlier group is. An evolutionary search computation technique is employed for determining those projections of the data over which the outliers can be identified. A new data structure, called the grid count tree (GCT), is used for efficient computation of the sparsity factor. GCT helps in quickly determining the number of points within any grid defined over the projected space and hence facilitates faster computation of the sparsity factor. A new crossover is also defined for this purpose. The proposed method is applicable for both numeric and categorical attributes. The search complexity of the GCT traversal algorithm is provided. Results are demonstrated for both artificial and real life data sets including four gene expression data sets.

[1]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[2]  Wan Weishui,et al.  Convergence theorem of genetic algorithm , 1996, 1996 IEEE International Conference on Systems, Man and Cybernetics. Information Intelligence and Systems (Cat. No.96CH35929).

[3]  Nong Ye,et al.  The Handbook of Data Mining , 2003 .

[4]  Alexander S. Szalay,et al.  Very Fast Outlier Detection in Large Multidimensional Data Sets , 2002, DMKD.

[5]  K. Dejong,et al.  An analysis of the behavior of a class of genetic adaptive systems , 1975 .

[6]  Prabhakar Raghavan,et al.  A Linear Method for Deviation Detection in Large Databases , 1996, KDD.

[7]  Zbigniew Michalewicz,et al.  Genetic Algorithms + Data Structures = Evolution Programs , 1996, Springer Berlin Heidelberg.

[8]  H P Friedman,et al.  Searching for evidence of altered gene expression: a comment on statistical analysis of microarray data. , 1999, Journal of the National Cancer Institute.

[9]  Peter Ross,et al.  Small Populations over Many Generations can beat Large Populations over Few Generations in Genetic P , 1997 .

[10]  Sudipto Guha,et al.  ROCK: A Robust Clustering Algorithm for Categorical Attributes , 2000, Inf. Syst..

[11]  Hans-Peter Kriegel,et al.  OPTICS-OF: Identifying Local Outliers , 1999, PKDD.

[12]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[13]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[14]  Dr. Alex A. Freitas Data Mining and Knowledge Discovery with Evolutionary Algorithms , 2002, Natural Computing Series.

[15]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[16]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[17]  Kenneth Alan De Jong,et al.  An analysis of the behavior of a class of genetic adaptive systems. , 1975 .

[18]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[19]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[20]  Zbigniew Michalewicz,et al.  Genetic algorithms + data structures = evolution programs (3rd ed.) , 1996 .

[21]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[22]  Zengyou He,et al.  Discovering cluster-based local outliers , 2003, Pattern Recognit. Lett..

[23]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[24]  Chao Yan,et al.  Outlier analysis for gene expression data , 2008, Journal of Computer Science and Technology.

[25]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[26]  Charu C. Aggarwal,et al.  Optimized Crossover for the Independent Set Problem , 1997, Oper. Res..

[27]  Anthony K. H. Tung,et al.  Mining top-n local outliers in large databases , 2001, KDD '01.

[28]  Stefan Berchtold,et al.  Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets , 2003, IEEE Trans. Knowl. Data Eng..

[29]  Douglas M. Hawkins Identification of Outliers , 1980, Monographs on Applied Probability and Statistics.

[30]  M. Goldstein,et al.  Analysis of Gene Expression Data , 2022 .

[31]  Raymond T. Ng,et al.  Finding Intensional Knowledge of Distance-Based Outliers , 1999, VLDB.

[32]  Theodore Johnson,et al.  Range selectivity estimation for continuous attributes , 1999, Proceedings. Eleventh International Conference on Scientific and Statistical Database Management.