What is the distribution of the number of unique original items in a bootstrap sample

Sampling with replacement occurs in many settings in machine learning, notably in the bagging ensemble technique and the .632+ validation scheme. The number of unique original items in a bootstrap sample can have an important role in the behaviour of prediction models learned on it. Indeed, there are uncontrived examples where duplicate items have no effect. The purpose of this report is to present the distribution of the number of unique original items in a bootstrap sample clearly and concisely, with a view to enabling other machine learning researchers to understand and control this quantity in existing and future resampling techniques. We describe the key characteristics of this distribution along with the generalisation for the case where items come from distinct categories, as in classification. In both cases we discuss the normal limit, and conduct an empirical investigation to derive a heuristic for when a normal approximation is permissible.

[1]  Allan Gut,et al.  An intermediate course in probability , 1995 .

[2]  Ronald L. Graham,et al.  Concrete Mathematics, a Foundation for Computer Science , 1991, The Mathematical Gazette.

[3]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[4]  B. Efron Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation , 1983 .

[5]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[6]  C. R. Rao,et al.  Bootstrap by Sequential Resampling. , 1997 .

[7]  An application of Stein’s method to limit theorems for pairwise negative quadrant dependent random variables , 2007 .

[8]  Martin Schader,et al.  Two Rules of Thumb for the Approximation of the Binomial Distribution by the Normal Distribution , 1989 .

[9]  Rafael Pino-Mejías,et al.  Reduced bootstrap aggregating of learning algorithms , 2008, Pattern Recognit. Lett..

[10]  R. Tibshirani,et al.  Improvements on Cross-Validation: The 632+ Bootstrap Method , 1997 .

[11]  Rafael Pino-Mejías,et al.  Identification of outlier bootstrap samples , 1997 .

[12]  Ji-Hyun Kim,et al.  Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap , 2009, Comput. Stat. Data Anal..

[13]  Matthew W. Mitchell Bias of the Random Forest Out-of-Bag (OOB) Error for Certain Input Parameters , 2011 .

[14]  P. Bühlmann,et al.  Analyzing Bagging , 2001 .

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  I. Weiss Limiting Distributions in Some Occupancy Problems , 1958 .

[17]  Norman L. Johnson,et al.  Urn models and their application , 1977 .

[18]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[19]  G. Imbens,et al.  On the Failure of the Bootstrap for Matching Estimators , 2006 .