论文信息 - Learning from Infinite Data in Finite Time

Learning from Infinite Data in Finite Time

We propose the following general method for scaling learning algorithms to arbitrarily large data sets. Consider the model Mn→ learned by the algorithm using ni examples in step i (n→ = (n1,..., nm)), and the model M∞ that would be learned using infinite examples. Upper-bound the loss L(Mn→,M,∞) between them as a function of n→, and then minimize the algorithm's time complexity ƒ(n→) subject to the constraint that L(M∞, Mn→) be at most e with probability at most δ. We apply this method to the EM algorithm for mixtures of Gaussians. Preliminary experiments on a series of large data sets provide evidence of the potential of this approach.

Geoff Hulten | Pedro M. Domingos | Geoff Hulten

[1] W. Hoeffding. Probability Inequalities for sums of Bounded Random Variables , 1963 .

[2] Geoff Hulten,et al. A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering , 2001, ICML.

[3] Alec Wolman,et al. Organization-Based Analysis of Web-Object Sharing and Caching , 1999, USENIX Symposium on Internet Technologies and Systems.

[4] Bo Thiesson,et al. The Learning Curve Method Applied to Clustering , 2001, AISTATS.

[5] Geoff Hulten,et al. Mining high-speed data streams , 2000, KDD '00.

[6] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[7] Raghu Ramakrishnan,et al. Proceedings : KDD 2000 : the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 20-23, 2000, Boston, MA, USA , 2000 .

[8] Tian Zhang,et al. BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.