The Coding Divergence for Measuring the Complexity of Separating Two Sets

In this paper we integrate two essential processes, discretization of continuous data and learning of a model that explains them, towards fully computational machine learning from continuous data. Discretization is fundamental for machine learning and data mining, since every continuous datum; e.g., a real-valued datum obtained by observation in the real world, must be discretized and converted from analog (continuous) to digital (discrete) form to store in databases. However, most machine learning methods do not pay attention to the situation; i.e., they use digital data in actual applications on a computer whereas assume analog data (usually vectors of real numbers) theoretically. To bridge the gap, we propose a novel measure of the difference between two sets of data, called the coding divergence, and unify two processes discretization and learning computationally. Discretization of continuous data is realized by a topological mapping (in the sense of mathematics) from the d-dimensional Euclidean space R into the Cantor space Σ , and the simplest model is learned in the Cantor space, which corresponds to the minimum open set separating the given two sets of data. Furthermore, we construct a classifier using the divergence, and experimentally demonstrate robust performance of it. Our contribution is not only introducing a new measure from the computational point of view, but also triggering more interaction between experimental science and machine learning.

[1]  M. S. Bartlett,et al.  Statistical methods and scientific inference. , 1957 .

[2]  Jaakko Hollmén,et al.  Quantization of Continuous Input Variables for Binary Classification , 2000, IDEAL.

[3]  A. Friedman Foundations of modern analysis , 1970 .

[4]  E. S. Pearson,et al.  ON THE USE AND INTERPRETATION OF CERTAIN TEST CRITERIA FOR PURPOSES OF STATISTICAL INFERENCE PART I , 1928 .

[5]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[6]  Kurt Hornik,et al.  kernlab - An S4 Package for Kernel Methods in R , 2004 .

[7]  Nir Friedman,et al.  Bayesian Network Classification with Continuous Attributes: Getting the Best of Both Discretization and Parametric Fitting , 1998, ICML.

[8]  Yong Deng,et al.  A new Hausdorff distance for image matching , 2005, Pattern Recognit. Lett..

[9]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[10]  Daniel P. Huttenlocher,et al.  Comparing Images Using the Hausdorff Distance , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Henry Tirri,et al.  A Bayesian Approach to Discretization , 1997 .

[12]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[13]  Samson Abramsky,et al.  Handbook of logic in computer science. , 1992 .

[14]  Eamonn J. Keogh,et al.  A symbolic representation of time series, with implications for streaming algorithms , 2003, DMKD '03.

[15]  Tapio Elomaa,et al.  Necessary and Sufficient Pre-processing in Numerical Range Discretization , 2003, Knowledge and Information Systems.

[16]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[17]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[18]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[19]  Zhi-Hua Zhou,et al.  Isolation Forest , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[20]  Douglas H. Johnson The Insignificance of Statistical Significance Testing , 1999 .

[21]  Matthias Schröder,et al.  Extended admissibility , 2002, Theor. Comput. Sci..

[22]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[23]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[24]  Huan Liu,et al.  Discretization: An Enabling Technique , 2002, Data Mining and Knowledge Discovery.

[25]  Klaus Weihrauch,et al.  Computable Analysis: An Introduction , 2014, Texts in Theoretical Computer Science. An EATCS Series.

[26]  D. C. Baird,et al.  Experimentation: An Introduction to Measurement Theory and Experiment Design , 1965 .

[27]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[28]  João Gama,et al.  Discretization from data streams: applications to histograms and data mining , 2006, SAC.

[29]  Karl Rihaczek,et al.  1. WHAT IS DATA MINING? , 2019, Data Mining for the Social Sciences.

[30]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.