Online Clustering and Outlier Detection

Clustering and outlier detection are important data mining areas. Online clustering and outlier detection generally work with continuous data streams generated at a rapid rate and have many practical applications, such as network instruction detection and online fraud detection. This chapter first reviews related background of online clustering and outlier detection. Then, an incremental clustering and outlier detection method for market-basket data is proposed and presented in details. This proposed method consists of two phases: weighted affinity measure clustering (WC clustering) and outlier detection. Specifically, given a data set, the WC clustering phase analyzes the data set and groups data items into clusters. Then, outlier detection phase examines each newly arrived transaction against the item clusters formed in WC clustering phase, and determines whether the new transaction is an outlier. Periodically, the newly collected transactions are analyzed using WC clustering to produce an updated set of clusters, against which transactions arrived afterwards are examined. The process is carried out continuously and incrementally. Finally, the future research trends on online data mining are explored at the end of the chapter. DOI: 10.4018/978-1-4666-2455-9.ch008

[1]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[2]  Raymond T. Ng,et al.  Finding Intensional Knowledge of Distance-Based Outliers , 1999, VLDB.

[3]  Fabrizio Angiulli,et al.  Distance-based outlier queries in data streams: the novel task and algorithms , 2010, Data Mining and Knowledge Discovery.

[4]  Mostafa S. Haghjoo,et al.  Parallel processing of continuous queries over data streams , 2010, Distributed and Parallel Databases.

[5]  Laurie A. Williams,et al.  Audit Mechanisms in Electronic Health Record Systems: Protected Health Information May Remain Vulnerable to Undetected Misuse , 2012, Int. J. Comput. Model. Algorithms Medicine.

[6]  Edward Omiecinski,et al.  Alternative Interest Measures for Mining Associations in Databases , 2003, IEEE Trans. Knowl. Data Eng..

[7]  Srinivasan Parthasarathy,et al.  Fast mining of distance-based outliers in high-dimensional datasets , 2008, Data Mining and Knowledge Discovery.

[8]  Kate Smith-Miles,et al.  On the communal analysis suspicion scoring for identity crime in streaming credit applications , 2009, Eur. J. Oper. Res..

[9]  Francisco Herrera,et al.  A Survey on Evolutionary Instance Selection and Generation , 2010, Int. J. Appl. Metaheuristic Comput..

[10]  Hui Xiong,et al.  Mining strong affinity association patterns in data sets with skewed support distribution , 2003, Third IEEE International Conference on Data Mining.

[11]  José R. Dorronsoro,et al.  Neural fraud detection in credit card operations , 1997, IEEE Trans. Neural Networks.

[12]  Yuhui Shi Emerging Research on Swarm Intelligence and Algorithm Optimization , 2014 .

[13]  Niall M. Adams,et al.  Plastic card fraud detection using peer group analysis , 2008, Adv. Data Anal. Classif..

[14]  Qiang Ding,et al.  Association Rule Mining on Remotely Sensed Images Using P-trees , 2002, PAKDD.

[15]  Zengyou He,et al.  An Optimization Model for Outlier Detection in Categorical Data , 2005, ICIC.

[16]  Guangming Xing Approximate Matching Between XML Documents and Schemas with Applications in XML Classification and Clustering , 2012 .

[17]  Chang-Tien Lu,et al.  Detecting spatial outliers with multiple attributes , 2003, Proceedings. 15th IEEE International Conference on Tools with Artificial Intelligence.

[18]  Mohammed J. Zaki,et al.  CHARM: An Efficient Algorithm for Closed Itemset Mining , 2002, SDM.

[19]  Vipin Kumar,et al.  Hypergraph Based Clustering in High-Dimensional Data Sets: A Summary of Results , 1998, IEEE Data Eng. Bull..

[20]  Shastri L. Nimmagadda,et al.  Ontology-Based Data Warehousing and Mining Approaches in Petroleum Industries , 2007 .

[21]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[22]  Kun Li,et al.  Efficient Clustering-Based Outlier Detection Algorithm for Dynamic Data Stream , 2008, 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery.

[23]  Philip S. Yu,et al.  On High Dimensional Projected Clustering of Data Streams , 2005, Data Mining and Knowledge Discovery.

[24]  Eyke Hüllermeier,et al.  Online clustering of parallel data streams , 2006, Data Knowl. Eng..

[25]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[26]  David J. Hand,et al.  Statistical fraud detection: A review , 2002 .

[27]  David Wai-Lok Cheung,et al.  Parallel Mining of Outliers in Large Database , 2004, Distributed and Parallel Databases.

[28]  D. Hand,et al.  Unsupervised Profiling Methods for Fraud Detection , 2002 .

[29]  Jeffrey Hsu,et al.  Critical and future trends in data mining: a review of key data mining technologies/applications , 2003 .

[30]  Graham J. Williams,et al.  On-Line Unsupervised Outlier Detection Using Finite Mixtures with Discounting Learning Algorithms , 2000, KDD '00.

[31]  Madjid Khalilian,et al.  Data Stream Clustering: Challenges and Issues , 2010, ArXiv.

[32]  Ming-Syan Chen,et al.  Adherence clustering: an efficient method for mining market-basket clusters , 2006, Inf. Syst..

[33]  Reda Alhajj,et al.  A comprehensive survey of numeric and symbolic outlier mining techniques , 2006, Intell. Data Anal..

[34]  Shamik Sural,et al.  Credit card fraud detection: A fusion approach using Dempster-Shafer theory and Bayesian learning , 2009, Inf. Fusion.

[35]  Chao-Hsien Chu,et al.  A Review of Data Mining-Based Financial Fraud Detection Research , 2007, 2007 International Conference on Wireless Communications, Networking and Mobile Computing.

[36]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.

[37]  James Bailey,et al.  An Efficient Technique for Mining Approximately Frequent Substring Patterns , 2007 .

[38]  Bahram Alidaee,et al.  Theorems Supporting r-flip Search for Pseudo-Boolean Optimization , 2010, Int. J. Appl. Metaheuristic Comput..

[39]  Doron Rotem,et al.  Bit Transposed Files , 1985, VLDB.

[40]  Sudipto Guha,et al.  ROCK: A Robust Clustering Algorithm for Categorical Attributes , 2000, Inf. Syst..

[42]  Shonali Krishnaswamy,et al.  Mining data streams: a review , 2005, SGMD.

[43]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.