Empirical Data Analytics

In this paper, we propose an approach to data analysis, which is based entirely on the empirical observations of discrete data samples and the relative proximity of these points in the data space. At the core of the proposed new approach is the typicality—an empirically derived quantity that resembles probability. This nonparametric measure is a normalized form of the square centrality (centrality is a measure of closeness used in graph theory). It is also closely linked to the cumulative proximity and eccentricity (a measure of the tail of the distributions that is very useful for anomaly detection and analysis of extreme values). In this paper, we introduce and study two types of typicality, namely its local and global versions. The local typicality resembles the well‐known probability density function (pdf), probability mass function, and fuzzy set membership but differs from all of them. The global typicality, on the other hand, resembles well‐known histograms but also differs from them. A distinctive feature of the proposed new approach, empirical data analysis (EDA), is that it is not limited by restrictive impractical prior assumptions about the data generation model as the traditional probability theory and statistical learning approaches are. Moreover, it does not require an explicit and binary assumption of either randomness or determinism of the empirically observed data, their independence, or even their number (it can be as low as a couple of data samples). The typicality is considered as a fundamental quantity in the pattern analysis, which is derived directly from data and is stated in a discrete form in contrast to the traditional approach where a continuous pdf is assumed a priori and estimated from data afterward. The typicality introduced in this paper is free from the paradoxes of the pdf. Typicality is objectivist while the fuzzy sets and the belief‐based branch of the probability theory are subjectivist. The local typicality is expressed in a closed analytical form and can be calculated recursively, thus, computationally very efficiently. The other nonparametric ensemble properties of the data introduced and studied in this paper, namely, the square centrality, cumulative proximity, and eccentricity, can also be updated recursively for various types of distance metrics. Finally, a new type of classifier called naïve typicality‐based EDA class is introduced, which is based on the newly introduced global typicality. This is only one of the wide range of possible applications of EDA including but not limited for anomaly detection, clustering, classification, control, prediction, control, rare events analysis, etc., which will be the subject of further research.

[1]  Plamen P. Angelov,et al.  A new type of simplified fuzzy rule-based system , 2012, Int. J. Gen. Syst..

[2]  Plamen Angelov,et al.  Anomaly detection based on eccentricity analysis , 2014, 2014 IEEE Symposium on Evolving and Autonomous Learning Systems (EALS).

[3]  Robert LIN,et al.  NOTE ON FUZZY SETS , 2014 .

[4]  Tong Zhang,et al.  An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods , 2001, AI Mag..

[5]  Plamen P. Angelov,et al.  Empirical data analysis: A new tool for data analytics , 2016, 2016 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[6]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[7]  J. G. Saw,et al.  Chebyshev Inequality With Estimated Mean and Variance , 1984 .

[8]  Derek A. Linkens,et al.  Rule-base self-generation and simplification for data-driven fuzzy models , 2001, 10th IEEE International Conference on Fuzzy Systems. (Cat. No.01CH37297).

[9]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[10]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[11]  Plamen Angelov,et al.  Autonomously evolving classifier TEDAClass , 2016, Inf. Sci..

[12]  Plamen P. Angelov,et al.  Evolving Fuzzy-Rule-Based Classifiers From Data Streams , 2008, IEEE Transactions on Fuzzy Systems.

[13]  Glenn Shafer,et al.  A Mathematical Theory of Evidence , 2020, A Mathematical Theory of Evidence.

[14]  Plamen Angelov,et al.  Autonomous Learning Systems: From Data Streams to Knowledge in Real-time , 2013 .

[15]  Michael I. Jordan,et al.  Nonparametric empirical Bayes for the Dirichlet process mixture model , 2006, Stat. Comput..

[16]  P. Moral Nonlinear filtering : Interacting particle resolution , 1997 .

[17]  Gert Sabidussi,et al.  The centrality index of a graph , 1966 .

[18]  Plamen Angelov Typicality distribution function — A new density-based data analytics tool , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[19]  L. Freeman Centrality in social networks conceptual clarification , 1978 .

[20]  A. Osokin,et al.  Automatic determination of the number of components in the EM algorithm of restoration of a mixture of normal distributions , 2010 .

[21]  Plamen P. Angelov,et al.  Simpl_eClass: Simplified potential-free evolving fuzzy rule-based classifiers , 2011, 2011 IEEE International Conference on Systems, Man, and Cybernetics.

[22]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[23]  T. Bayes An essay towards solving a problem in the doctrine of chances , 2003 .

[24]  Jose C. Principe,et al.  Information Theoretic Learning - Renyi's Entropy and Kernel Perspectives , 2010, Information Theoretic Learning.

[25]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[26]  Adrian Corduneanu,et al.  Variational Bayesian Model Selection for Mixture Distributions , 2001 .

[27]  Rauf Izmailov,et al.  Statistical Inference Problems and Their Rigorous Solutions - In memory of Alexey Chervonenkis , 2015, SLDS.