Grosser Systeme Echtzeitoptimierung Schwerpunktprogramm Der Deutschen Forschungsgemeinschaft Empirical Risk Approximation: an Induction Principle for Unsupervised Learning

Unsupervised learning algorithms are designed to extract structure from data without reference to explicit teacher information. The quality of the learned structure is determined by a cost function which guides the learning process. This paper proposes Empirical Risk Approximation as a new induction principle for unsupervised learning. The complexity of the unsupervised learning models are automatically controlled by the two conditions for learning: (i) the empirical risk of learning should uniformly converge towards the expected risk; (ii) the hypothesis class should retain a minimal variety for consistent inference. The maximal entropy principle with deterministic annealing as an eecient search strategy arises from the Empirical Risk Approximation principle as the optimal inference strategy for large learning problems. Parameter selection of learnable data structures is demonstrated for the case of k-means clustering. 1 What is unsupervised learning? Learning algorithms are designed with the goal in mind that they should extract structure from data. Two classes of algorithms have been widely discussed in the literature { supervised and unsupervised learning. The distinction between the two classes relates to supervision or teacher information which is either available to the learning algorithm or missing in the learning process. This paper presents a theory of unsupervised learning which has been developed in analogy to the highly successful statistical learning theory of classiication and regression Vapnik, 1982, Vapnik, 1995]. In supervised learning of classiication boundaries or of regression functions the learning algorithm is provided with example points and selects the best candidate function from a set of functions, called the hypothesis class. Statistical learning theory, developed by Vapnik and Chervonenkis in a series of seminal papers (see Vapnik, 1982, Vapnik, 1995]), measures the amount of information in a data set which can be used to determine the parameters of the classiication or regression models. Computational learning theory Valiant, 1984] addresses computational problems of supervised learning in addition to the statistical constraints. 2 In this paper I propose a theoretical framework for unsupervised learning based on optimization of a quality functional for structures in data. The learning algorithm extracts an underlying structure from a sample data set under the guidance of a quality measure denoted as learning costs. The extracted structure of the data is encoded by a loss function and it is assumed to produce a learning risk below a predeened risk threshold. This induction principle is refered to as Empirical Risk Approximation (ERA) and is summarized …

[1]  E. M.,et al.  Statistical Mechanics , 2021, Manual for Theoretical Chemistry.

[2]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[3]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[4]  David Pollard,et al.  Quantization and the method of k -means , 1982, IEEE Trans. Inf. Theory.

[5]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[6]  R. Gray,et al.  Vector quantization , 1984, IEEE ASSP Magazine.

[7]  Robin Sibson,et al.  What is projection pursuit , 1987 .

[8]  L. Devroye,et al.  Nonparametric density estimation : the L[1] view , 1987 .

[9]  Teuvo Kohonen,et al.  Self-Organization and Associative Memory , 1988 .

[10]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.

[11]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1995, COLT '90.

[12]  Rose,et al.  Statistical mechanics and phase transitions in clustering. , 1990, Physical review letters.

[13]  Geoffrey E. Hinton,et al.  Self-organizing neural network that discovers surfaces in random-dot stereograms , 1992, Nature.

[14]  Helge J. Ritter,et al.  Neural computation and self-organizing maps - an introduction , 1992, Computation and neural systems series.

[15]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[16]  Joachim M. Buhmann,et al.  Vector quantization with complexity costs , 1993, IEEE Trans. Inf. Theory.

[17]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[18]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[19]  Tamás Linder,et al.  Rates of convergence in the source coding theorem, in empirical quantizer design, and in universal lossy source coding , 1994, IEEE Trans. Inf. Theory.

[20]  Sompolinsky,et al.  Statistical mechanics of the maximum-likelihood density estimation. , 1994, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[21]  D. Haussler,et al.  Rigorous learning curve bounds from statistical mechanics , 1994, COLT '94.

[22]  G. Lugosi,et al.  Rates of convergence in the source coding theorem, in empirical quantizer design, and in universal lossy source coding , 1994, Proceedings of 1994 IEEE International Symposium on Information Theory.

[23]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[24]  Geoffrey E. Hinton,et al.  The Helmholtz Machine , 1995, Neural Computation.

[25]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[26]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[27]  J. Wellner,et al.  Rates of Convergence , 1996 .

[28]  M. Talagrand A new look at independence , 1996 .

[29]  Joachim M. Buhmann,et al.  Pairwise Data Clustering by Deterministic Annealing , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  Geoffrey E. Hinton,et al.  Generative models for discovering sparse distributed representations. , 1997, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[31]  Tamás Linder,et al.  Empirical quantizer design in the presence of source noise or channel noise , 1997, IEEE Trans. Inf. Theory.

[32]  Joachim M. Buhmann,et al.  Multidimensional Scaling by Deterministic Annealing , 1997, EMMCVPR.

[33]  Yishay Mansour,et al.  An Information-Theoretic Analysis of Hard and Soft Assignment Methods for Clustering , 1997, UAI.

[34]  Christopher M. Bishop,et al.  GTM: The Generative Topographic Mapping , 1998, Neural Computation.

[35]  유정수,et al.  어닐링에 의한 Hierarchical Mixtures of Experts를 이용한 시계열 예측 , 1998 .

[36]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[37]  G. Barreto,et al.  A SELF-ORGANIZING NEURAL NETWORK , 2000 .

[38]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.