Information Feature Selection: Using Local Attribute Selections to Represent Connected Distributions in Complex Datasets

Clustering algorithms like k-means, BIRCH, CLARANS and DBSCAN are designed to be scalable and they are developed to discover clusters in the full dimensional space of a database. Nevertheless their characteristics depend upon the size of the database. A DB/data warehouse may store terabytes of data. Complex data analysis (mining) may take a very long time to run on the complex dataset. One has to obtain a reduced representation of the dataset that is much smaller in volume - but yet produces the same or almost the same analytical results - in order to accelerate information processing. Reduced representations yield simplified models that are easier to interpret, avoid the curse of dimensionality and enhance generalization by reducing overfitting. Data reduction methods include data cube aggregation, attribute subset selection, fitting data into models, dimensionality reduction, hierarchies as well as other approaches. Feature selection is considered as a specific case of a more general paradigm which is called Structure Learning in cases of an outcome associated to a set of attributes. Feature selection aims at selecting a minimum set of features such that the probability distribution of different classes given the values of those features is as close as possible to the original distribution given the values of all features. A combined approach based upon representing complex datasets in DB as a minimal set of connected attribute sets of reduced dimensions is herein proposed. Value-Difference (VD) Metrics based upon binary, categorical and continuous values are used for subspace clustering. Each cluster can be represented by a different set of object features/attributes maximizing the information which is rendered by the cluster representation. Numerical data regarding a test-bed system for anomaly detection are provided in order to illustrate the aforementioned approach.

[1]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[2]  Michel Verleysen,et al.  Mutual information based feature selection for mixed data , 2011, ESANN.

[3]  Huan Liu,et al.  Feature selection for clustering - a filter solution , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[4]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Jesús Cerquides,et al.  Comparison of Redundancy and Relevance Measures for Feature Selection in Tissue Classification of CT Images , 2010, ICDM.

[6]  Chong-Ho Choi,et al.  Input feature selection for classification problems , 2002, IEEE Trans. Neural Networks.

[7]  Simon Oechsner,et al.  A framework for resilience management in the cloud , 2015, Elektrotech. Informationstechnik.

[8]  David L. Waltz,et al.  Toward memory-based reasoning , 1986, CACM.

[9]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[10]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[11]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[12]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[13]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).