DynamicWEB : a conceptual clustering algorithm for a changing world

This research was motivated by problems in network security, where an attacker often deliberately changes their identifying information and behaviour in order to camouflage their malicious behaviour. Addressing this problem has resulted in a new adaption to the unsupervised machine learning technique COBWEB. In machine learning and data mining the aim is to extract patterns from data in order to discover a meaning underlying the processes that are taking place. In most cases, each object is observed once, and then the patterns that have been extracted can be used to classify newly-observed objects. Conceptual clustering aims to do this in such a way that the patterns that are learned are human readable. Concept drift algorithms allow concepts to change over time, although most undertake this in a supervised manner, which presents a challenge when looking for novel classes. This research focuses on the classification of objects that change over time across multiple observations. The objects may change their own characteristics (labelled as object drift in this research) or maintain the same characteristics, but change their identifier. In addition to this, it is also possible for the concept that describes a group of objects to itself change (known as concept drift). In addition to the possible application within the security domain, the method was generalised and tested across a range of machine learning and data mining domains. In the process it was shown that the method was robust in the presence of concept drift, which occurs when a group of objects that define a given concept change their characteristics, resulting in the definition of that concept having changed over time. The ideas of concept drift and object drift are not only relevant within the computer security field, but can be of significance in any knowledge domain. Therefore, any method presented to address this learning problem should be generalised enough to be applicable in many application areas. The new method, entitled DynamicWEB, extends the existing conceptual clustering method COBWEB to allow for profiles to be added and removed from the concept hierarchy. An index structure was implemented using an AVL tree to facilitate fast scalable searching of the knowledge structure. As the target objects change over time the profiles of each target are updated within the structure, maintaining an up-to-date representation of the domain. The profiles contain derived attributes, which are formed across multiple observations of each object, with the aim of retaining knowledge of how the object has changed over time. As well as preserving context over time, Dynamic Web uses multiple trees and so, transforms the learner into an ensemble classifier. In addition to testing the method on the security and network based datasets, a number of other datasets are also examined. A new dataset (a modified version of Quinlan’s weather dataset) is presented in order to illustrate how Dynamic Web operates in the presence of object drift. The method is also tested on several wellknown machine learning datasets, some of which exhibit concept drift. Along with these artificial datasets, a group of real-world datasets, including several sourced from the Australian Bureau of Statistics, were also examined, illustrating DynamicWEB’s ability to adapt to change. This thesis describes the work done to enable DynamicWEB to adapt to both concept drift and object drift, both of which are characteristic of many application domains. DynamicWEB is also capable of profiling an object across multiple observations to allow for accurate prediction and inter-object relationship discovery.

[1]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[2]  Ryszard S. Michalski,et al.  An Application of AI Techniques to Structuring Objects into an Optimal Conceptual Hierarchy , 1981, IJCAI.

[3]  Pat Langley,et al.  Models of Incremental Concept Formation , 1990, Artif. Intell..

[4]  Douglas H. Fisher,et al.  The Structure and Formation of Natural Categories , 1990 .

[5]  Jerzy W. Bala,et al.  Hybrid Learning Using Genetic Algorithms and Decision Trees for Pattern Classification , 1995, IJCAI.

[6]  Fredrik Kilander,et al.  COBBIT - A Control Procedure for COBWEB in the Presence of Concept Drift , 1993, ECML.

[7]  Pat Langley,et al.  Approaches to Conceptual Clustering , 1985, IJCAI.

[8]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[9]  Mark A. Gluck,et al.  Information, Uncertainty and the Utility of Categories , 1985 .

[10]  Chrisila C. Pettey,et al.  A hybrid conceptual clustering system , 1996, CSC '96.

[11]  Edwin Diday,et al.  A Recent Advance in Data Analysis: Clustering Objects into Classes Characterized by Conjunctive Concepts , 1981 .

[12]  L. Wittgenstein Philosophical investigations = Philosophische Untersuchungen , 1958 .

[13]  Kenneth O. Stanley Learning Concept Drift with a Committee of Decision Trees , 2003 .

[14]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[15]  R. Michalski,et al.  Learning from Observation: Conceptual Clustering , 1983 .

[16]  R. Michalski Variable-Valued Logic: System VL1 , 1974 .

[17]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[18]  Arthur J. Nevins A branch and bound incremental conceptual clusterer , 1995, Machine Learning.

[19]  Philip J. Stone,et al.  Experiments in induction , 1966 .

[20]  E. Amoroso Intrusion Detection , 1999 .

[21]  J. Andel Sequential Analysis , 2022, The SAGE Encyclopedia of Research Design.

[22]  Pat Langley,et al.  Constraints on Tree Structure in Concept Formation , 1991, IJCAI.

[23]  E. Rosch,et al.  Categorization of Natural Objects , 1981 .

[24]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[25]  Gerhard Widmer,et al.  Learning in the presence of concept drift and hidden contexts , 2004, Machine Learning.

[26]  Douglas H. Fisher,et al.  Knowledge acquisition via incremental conceptual clustering , 2004, Machine Learning.

[27]  G. W. Milligan,et al.  A monte carlo study of thirty internal criterion measures for cluster analysis , 1981 .

[28]  M. AdelsonVelskii,et al.  AN ALGORITHM FOR THE ORGANIZATION OF INFORMATION , 1963 .

[29]  R. Rescorla Probability of shock in the presence and absence of CS in fear conditioning. , 1968, Journal of comparative and physiological psychology.

[30]  Steven J. Fenves,et al.  Applying AI clustering to engineering tasks , 1993, IEEE Expert.

[31]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[32]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[33]  Xiaolong Wang,et al.  SVM-Based Spam Filter with Active and Online Learning , 2006, TREC.

[34]  Michal Zalewski Silence on the Wire: A Field Guide to Passive Reconnaissance and Indirect Attacks , 2005 .

[35]  Michael Lebowitz,et al.  Experiments with Incremental Concept Formation: UNIMEM , 1987, Machine Learning.

[36]  Steven J. Fenves,et al.  The formation and use of abstract concepts in design , 1991 .

[37]  D. Pham,et al.  An Incremental K-means algorithm , 2004 .

[38]  Pat Langley,et al.  Unsupervised Learning of Probabilistic Concept Hierarchies , 2001, Machine Learning and Its Applications.

[39]  Vern Paxson,et al.  Bro: a system for detecting network intruders in real-time , 1998, Comput. Networks.

[40]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[41]  T. Warren Liao,et al.  Clustering of time series data - a survey , 2005, Pattern Recognit..

[42]  Alexey Tsymbal,et al.  The problem of concept drift: definitions and related work , 2004 .

[43]  Xiaozhe Wang,et al.  Characteristic-Based Clustering for Time Series Data , 2006, Data Mining and Knowledge Discovery.

[44]  P. Hansen,et al.  Complete-Link Cluster Analysis by Graph Coloring , 1978 .

[45]  Wei-Hao Lin,et al.  Informedia at PDMC , 2004 .

[46]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[47]  RICHARD C. DUBES,et al.  How many clusters are best? - An experiment , 1987, Pattern Recognit..

[48]  William M. Smith,et al.  A Study of Thinking , 1956 .

[49]  E Gamzu,et al.  Classical Conditioning of a Complex Skeletal Response , 1971, Science.

[50]  Farnam Jahanian,et al.  A Survey of Botnet Technology and Defenses , 2009, 2009 Cybersecurity Applications & Technology Conference for Homeland Security.

[51]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[52]  Stuart Staniford-Chen,et al.  Practical Automated Detection of Stealthy Portscans , 2002, J. Comput. Secur..

[53]  M. Gluck,et al.  Explaining Basic Categories: Feature Predictability and Information , 1992 .

[54]  Hari Balakrishnan,et al.  Fast portscan detection using sequential hypothesis testing , 2004, IEEE Symposium on Security and Privacy, 2004. Proceedings. 2004.

[55]  Miroslav Kubat,et al.  The System FLORA: Learning from Type-Varying Training Sets , 1991, EWSL.

[56]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[57]  Chris Chatfield,et al.  Introduction to Statistical Time Series. , 1976 .

[58]  Geoff Holmes,et al.  Clustering Large Datasets Using Cobweb and K-Means in Tandem , 2004, Australian Conference on Artificial Intelligence.

[59]  João Gama,et al.  Physiological Data Modeling Contest , 2004 .

[60]  Chengqi Zhang,et al.  MA-IDS Architecture for Distributed Intrusion Detection using Mobile Agent , 2004 .

[61]  Janet L. Kolodner,et al.  Reconstructive Memory: A Computer Model , 1983, Cogn. Sci..

[62]  Eamonn J. Keogh,et al.  Clustering of streaming time series is meaningless , 2003, DMKD '03.

[63]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, CACM.

[64]  Michael Lebowitz,et al.  Concept Learning in a Rich Input Domain: Generalization-Based Memory , 1984 .

[65]  D. Medin,et al.  Family resemblance, conceptual cohesiveness, and category construction , 1987, Cognitive Psychology.

[66]  Richard Granger,et al.  Incremental Learning from Noisy Data , 1986, Machine Learning.

[67]  Ramayya Krishnan,et al.  Incremental hierarchical clustering of text documents , 2006, CIKM '06.

[68]  R. Jancey Multidimensional group analysis , 1966 .

[69]  Edward A. Wasserman,et al.  Perception of causal relations in humans: Factors affecting judgments of response-outcome contingencies under free-operant procedures☆ , 1983 .

[70]  H. Simon,et al.  EPAM-like Models of Recognition and Learning , 1984, Cogn. Sci..

[71]  Joel Scanlan,et al.  DynamicWEB: Adapting to Concept Drift and Object Drift in COBWEB , 2008, Australasian Conference on Artificial Intelligence.

[72]  Geoffrey H. Ball,et al.  ISODATA, A NOVEL METHOD OF DATA ANALYSIS AND PATTERN CLASSIFICATION , 1965 .

[73]  Ralf Klinkenberg,et al.  Learning drifting concepts: Example selection vs. example weighting , 2004, Intell. Data Anal..

[74]  Philip Chan,et al.  Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[75]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[76]  Joel Scanlan,et al.  Intrusion Detection by Intelligent analysis of data across multiple gateways in real-time. , 2004 .

[77]  Marcus A. Maloof,et al.  Dynamic weighted majority: a new ensemble method for tracking concept drift , 2003, Third IEEE International Conference on Data Mining.

[78]  Stephen Jose Hanson,et al.  Conceptual clustering, categorization, and polymorphy , 2004, Machine Learning.