Mutual information algorithms for optimal attribute selection in data driven partitions of databases

Clustering algorithms like k -means, BIRCH, CLARANS and DBSCAN are designed to be scalable and they are developed to discover clusters in the full dimensional space of a database. Nevertheless their characteristics depend upon the size of the database. A database/data warehouse may store terabytes of data. Complex data analysis (mining) may take a very long time to run on the complex dataset. One has to obtain a reduced representation of the dataset that is much smaller in volume—but yet produces the same or almost the same analytical results—in order to accelerate information processing. Reduced representations yield simplified models that are easier to interpret, avoid the curse of dimensionality and enhance generalization by reducing overfitting. Data reduction methods include data cube aggregation, attribute subset selection, fitting data into models, dimensionality reduction, hierarchies as well as other approaches. On the other hand, data-dependent partitions—like the Gessaman’s partition and tree-quantization partition—allow for processing different partitions of a dataset separately. Hence parallel processing may be used as an option for big data. Online analytical processing is a practical approach that deals with multi-dimensional queries in DB management. Feature selection is considered as a specific case of a more general paradigm which is called Structure Learning in cases of an outcome associated to a set of attributes. Feature selection aims at selecting a minimum set of features such that the probability distribution of different classes given the values of those features is as close as possible to the original distribution given the values of all features. A mutual information approach based upon representing complex datasets in DB as a minimal set of coherent attribute sets of reduced dimensions is herein proposed. The novelty of the proposed approach consists of employing piecewise analysis of compact clusters in order to increase overall Shannon’s mutual information-entropy as a variant to conventional Classification and Regression Trees. Numerical data regarding a test-bed system for anomaly detection are provided in order to illustrate the aforementioned approach.

[1]  Simon Oechsner,et al.  A framework for resilience management in the cloud , 2015, Elektrotech. Informationstechnik.

[2]  Jesús Cerquides,et al.  Comparison of Redundancy and Relevance Measures for Feature Selection in Tissue Classification of CT Images , 2010, ICDM.

[3]  Miao Duo,et al.  A HEURISTIC ALGORITHM FOR REDUCTION OF KNOWLEDGE , 1999 .

[4]  David L. Waltz,et al.  Toward memory-based reasoning , 1986, CACM.

[5]  Wentian Li Mutual information functions versus correlation functions , 1990 .

[6]  GunopulosDimitrios,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998 .

[7]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[8]  Lin Sun,et al.  Information Entropy and Mutual Information-based Uncertainty Measures in Rough Set Theory , 2014 .

[9]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[10]  M. Gessaman A Consistent Nonparametric Multivariate Density Estimator Based on Statistically Equivalent Blocks , 1970 .

[11]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[12]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[13]  Partha Garai,et al.  Fuzzy–Rough Simultaneous Attribute Selection and Feature Extraction Algorithm , 2013, IEEE Transactions on Cybernetics.

[14]  Chong-Ho Choi,et al.  Input feature selection for classification problems , 2002, IEEE Trans. Neural Networks.

[15]  Ian Witten,et al.  Data Mining , 2000 .

[16]  Huan Liu,et al.  Feature selection for clustering - a filter solution , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[17]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  中澤 真,et al.  Devroye, L., Gyorfi, L. and Lugosi, G. : A Probabilistic Theory of Pattern Recognition, Springer (1996). , 1997 .

[19]  Feifei Xu,et al.  Fuzzy-rough attribute reduction via mutual information with an application to cancer classification , 2009, Comput. Math. Appl..

[20]  Fraser,et al.  Independent coordinates for strange attractors from mutual information. , 1986, Physical review. A, General physics.

[21]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[22]  D. J. Lary,et al.  Survey on the estimation of mutual information methods as a measure of dependency versus correlation analysis , 2014 .

[23]  Igor Vajda,et al.  Estimation of the Information by an Adaptive Partitioning of the Observation Space , 1999, IEEE Trans. Inf. Theory.

[24]  Dun Liu,et al.  A fuzzy rough set approach for incremental feature selection on hybrid information systems , 2015, Fuzzy Sets Syst..

[25]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[26]  Ioannis M. Stephanakis,et al.  Information Feature Selection: Using Local Attribute Selections to Represent Connected Distributions in Complex Datasets , 2017, EANN.

[27]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.