Prediction for very large data sets is typically carried out in two stages, variable selection and pattern recognition. Ordinarily variable selection involves seeing how well individual explanatory variables are correlated with the dependent variable. This practice neglects the possible interactions among the variables. Simulations have shown that a statistic I, that we used for variable selection is much better correlated with predictivity than significance levels. We explain this by defining theoretical predictivity and show how I is related to predictivity. We calculate the biases of the overoptimistic training estimate of predictivity and of the pessimistic out of sample estimate. Corrections for the bias lead to improved estimates of the potential predictivity using small groups of possibly interacting variables. These results support the use of I in the variable selection phase of prediction for data sets such as in GWAS (Genome wide association studies) where there are very many explanatory variables and modest sample sizes. Reference is made to another publication using I, which led to a reduction in the error rate of prediction from 30% to 8%, for a data set with, 4,918 variables and 97 subjects. This data set had been previously studied by scientists for over 10 years.
[1]
Leo Breiman,et al.
Random Forests
,
2001,
Machine Learning.
[2]
Yudong D. He,et al.
Gene expression profiling predicts clinical outcome of breast cancer
,
2002,
Nature.
[3]
Pär Stattin,et al.
Cumulative association of five genetic variants with prostate cancer.
,
2008,
The New England journal of medicine.
[4]
Tian Zheng,et al.
Interaction-based feature selection and classification for high-dimensional biological data
,
2012,
Bioinform..
[5]
Herman Chernoff,et al.
Framework for making better predictions by directly estimating variables’ predictivity
,
2016,
Proceedings of the National Academy of Sciences.
[6]
H. Chernoff,et al.
Why significant variables aren’t automatically good predictors
,
2015,
Proceedings of the National Academy of Sciences.
[7]
I. Good.
THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS
,
1953
.
[8]
Herman Chernoff,et al.
Discovering influential variables: A method of partitions
,
2009,
1009.5744.