In this paper, a novel empirical data analysis approach (abbreviated as EDA) is introduced which is entirely data-driven and free from restricting assumptions and pre-defined problem- or user-specific parameters and thresholds. It is well known that the traditional probability theory is restricted by strong prior assumptions which are often impractical and do not hold in real problems. Machine learning methods, on the other hand, are closer to the real problems but they usually rely on problem- or user-specific parameters or thresholds making it rather art than science. In this paper we introduce a theoretically sound yet practically unrestricted and widely applicable approach that is based on the density in the data space. Since the data may have exactly the same value multiple times we distinguish between the data points and unique locations in the data space. The number of data points, k is larger or equal to the number of unique locations, l and at least one data point occupies each unique location. The number of different data points that have exactly the same location in the data space (equal value), ƒ can be seen as frequency. Through the combination of the spatial density and the frequency of occurrence of discrete data points, a new concept called multimodal typicality, τMM is proposed in this paper. It offers a closed analytical form that represents ensemble properties derived entirely from the empirical observations of data. Moreover, it is very close (yet different) from the histograms, from the probability density function (pdf) as well as from fuzzy set membership functions. Remarkably, there is no need to perform complicated pre-processing like clustering to get the multimodal representation. Moreover, the closed form for the case of Euclidean, Mahalanobis type of distance as well as some other forms (e.g. cosine-based dissimilarity) can be expressed recursively making it applicable to data streams and online algorithms. Inference/estimation of the typicality of data points that were not present in the data so far can be made. This new concept allows to rethink the very foundations of statistical and machine learning as well as to develop a series of anomaly detection, clustering, classification, prediction, control and other algorithms.
[1]
Radford M. Neal.
Pattern Recognition and Machine Learning
,
2007,
Technometrics.
[2]
Plamen Angelov.
Typicality distribution function — A new density-based data analytics tool
,
2015,
2015 International Joint Conference on Neural Networks (IJCNN).
[3]
Harris Papadopoulos,et al.
Statistical Learning and Data Sciences
,
2015,
Lecture Notes in Computer Science.
[4]
Nello Cristianini,et al.
An Introduction to Support Vector Machines and Other Kernel-based Learning Methods
,
2000
.
[5]
Plamen Angelov,et al.
Outside the box: an alternative data analytics framework
,
2014,
J. Autom. Mob. Robotics Intell. Syst..
[6]
Jose C. Principe,et al.
Information Theoretic Learning - Renyi's Entropy and Kernel Perspectives
,
2010,
Information Theoretic Learning.
[7]
Robert Tibshirani,et al.
The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition
,
2001,
Springer Series in Statistics.
[8]
Plamen P. Angelov,et al.
A Generalized Methodology for Data Analysis
,
2018,
IEEE Transactions on Cybernetics.
[9]
Plamen Angelov.
Autonomous Learning Systems:From Data to Knowledge in Real Time
,
2012
.
[10]
Plamen Angelov,et al.
Anomaly detection based on eccentricity analysis
,
2014,
2014 IEEE Symposium on Evolving and Autonomous Learning Systems (EALS).
[11]
Robert LIN,et al.
NOTE ON FUZZY SETS
,
2014
.
[12]
T. Bayes.
An essay towards solving a problem in the doctrine of chances
,
2003
.