Data filtering and distribution modeling algorithms for machine learning

This thesis is concerned with the analysis of algorithms for machine learning. The main focus is on the role of the distribution of the examples used for learning. Chapters 2 and 3 are concerned with algorithms for learning concepts from random examples. Briefly, the goal of the learner is to observe a set of labeled instances and generate a hypothesis that approximates the rule that maps the instances to their labels. Chapter 2 describes and analyses an algorithm for improving the performance of a general concept learning algorithm by selecting those labeled instances that are most informative. This work is an improvement over previous work by Schapire. The analysis provides upper bounds on the time, space and number of examples that are required for concept learning. Chapter 3 is concerned with situations in which the learner can select, out of a stream of random instances, those for which it wants to know the label. We analyze an algorithm of Seung et. al. for selecting such instances, and prove that it is effective for the Perceptron concept class. Both Chapters 2 and 3 show situations in which a carefully selected exponentially small fraction of the random training examples are sufficient for learning. Chapter 4 is concerned with learning distributions of binary vectors. Here we present a new distribution model that can represent combinations of correlation patterns. We describe two different algorithms for learning this distribution model from random examples, and provide experimental evidence that they are effective. We conclude, in Chapter 5, with a brief discussion of the possible use of our algorithms in real world problems and compare them with classical approaches from pattern recognition.

[1]  D. Cowling,et al.  Assessing the relationship between ad volume and awareness of a tobacco education media campaign , 2010, Tobacco Control.

[2]  D. Lindley On a Measure of the Information Provided by an Experiment , 1956 .

[3]  David R. Cox The analysis of binary data , 1970 .

[4]  W. J. Studden,et al.  Theory Of Optimal Experiments , 1972 .

[5]  Norbert Sauer,et al.  On the Density of Families of Sets , 1972, J. Comb. Theory, Ser. A.

[6]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[7]  John W. Tukey,et al.  A Projection Pursuit Algorithm for Exploratory Data Analysis , 1974, IEEE Transactions on Computers.

[8]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[9]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[10]  Tom Michael Mitchell Version spaces: an approach to concept learning. , 1979 .

[11]  Temple F. Smith Occam's razor , 1980, Nature.

[12]  Ing Rj Ser Approximation Theorems of Mathematical Statistics , 1980 .

[13]  A. Cohen,et al.  Finite Mixture Distributions , 1982 .

[14]  J J Hopfield,et al.  Neural networks and physical systems with emergent collective computational abilities. , 1982, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Carl H. Smith,et al.  Inductive Inference: Theory and Methods , 1983, CSUR.

[16]  J. Friedman,et al.  PROJECTION PURSUIT DENSITY ESTIMATION , 1984 .

[17]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[18]  D. Freedman,et al.  Asymptotics of Graphical Projection Pursuit , 1984 .

[19]  Geoffrey E. Hinton,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[20]  Peter Smith Convexity methods in variational calculus , 1985 .

[21]  David Haussler,et al.  Classifying learnable geometric concepts with the Vapnik-Chervonenkis dimension , 1986, STOC '86.

[22]  Carsten Peterson,et al.  A Mean Field Theory Learning Algorithm for Neural Networks , 1987, Complex Syst..

[23]  Stuart Geman,et al.  Stochastic Relaxation Methods for Image Restoration and Expert Systems , 1988 .

[24]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[25]  Stuart German,et al.  Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images , 1988 .

[26]  David Haussler,et al.  Predicting {0,1}-functions on randomly drawn points , 1988, COLT '88.

[27]  David Haussler,et al.  Equivalence of models for polynomial learnability , 1988, COLT '88.

[28]  Erkki Oja,et al.  Neural Networks, Principal Components, and Subspaces , 1989, Int. J. Neural Syst..

[29]  Steven J. Nowlan,et al.  Maximum Likelihood Competitive Learning , 1989, NIPS.

[30]  Terence D. Sanger,et al.  Optimal unsupervised learning in a single-layer linear feedforward neural network , 1989, Neural Networks.

[31]  Ronald L. Graham,et al.  Concrete mathematics - a foundation for computer science , 1991 .

[32]  David A. Cohn,et al.  Training Connectionist Networks with Queries and Selective Sampling , 1989, NIPS.

[33]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[34]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1995, COLT '90.

[35]  Eric B. Baum,et al.  Constructing Hidden Units Using Examples and Queries , 1990, NIPS.

[36]  Ronald L. Rivest,et al.  On the sample complexity of pac-learning using random and chosen examples , 1990, Annual Conference Computational Learning Theory.

[37]  Tomaso A. Poggio,et al.  Extensions of a Theory of Networks for Approximation and Learning , 1990, NIPS.

[38]  Robert E. Schapire,et al.  Efficient distribution-free learning of probabilistic concepts , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[39]  Wolfgang Kinzel,et al.  Improving a Network Generalization Ability by Selecting Examples , 1990 .

[40]  David Haussler,et al.  Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension , 1991, COLT '91.

[41]  Anders Krogh,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[42]  Eric B. Baum,et al.  Neural net algorithms that learn in polynomial time from examples and queries , 1991, IEEE Trans. Neural Networks.

[43]  Balas K. Natarajan,et al.  Machine Learning: A Theoretical Approach , 1992 .

[44]  Alexander A. Razborov,et al.  Majority gates vs. general weighted threshold gates , 1992, [1992] Proceedings of the Seventh Annual Structure in Complexity Theory Conference.

[45]  Harris Drucker,et al.  Improving Performance in Neural Networks Using a Boosting Algorithm , 1992, NIPS.

[46]  Yoav Freund,et al.  An improved boosting algorithm and its implications on learning complexity , 1992, COLT '92.

[47]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[48]  Michael Kearns,et al.  Efficient noise-tolerant learning from statistical queries , 1993, STOC.

[49]  György Turán,et al.  Lower bounds for PAC learning with queries , 1993, COLT '93.

[50]  Michael Kharitonov,et al.  Cryptographic hardness of distribution-specific learning , 1993, STOC.

[51]  Leslie G. Valiant,et al.  Cryptographic Limitations on Learning Boolean Formulae and Finite Automata , 1993, Machine Learning: From Theory to Applications.

[52]  W. Näther Optimum experimental designs , 1994 .

[53]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[54]  G. Kane Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol 1: Foundations, vol 2: Psychological and Biological Models , 1994 .

[55]  S. Klinke,et al.  Exploratory Projection Pursuit , 1995 .