A great number of services, experiments, and decisions at Yahoo! require analyzing rich data sources. This data almost invariably holds a large number of attributes. In these scenarios, the efficient selection of relevant attributes is imperative for data analysis (e.g., modeling, prediction). When approaching new data analysis tasks, domain experts, researchers, and engineers spent a considerable amount of resources identifying (manually or semi-automatically) these relevant attributes. This paper attempts to address this problem by providing a simple and largely automated attribute selection approach. The method is based on reformulating the mutual information (MI) measure. We show why MI cannot in general be used effectively without considerable domain expertise and describe a more appropriate measure that allows for a much larger level of automation (removing considerable manual work from the analysis loop). Experiments on the tasks of predicting clicks and conversions for Yahoo! display advertising platform in the context of the NGDStone project show the effectiveness of the proposed approach.
[1]
D. Lindley.
On a Measure of the Information Provided by an Experiment
,
1956
.
[2]
Ronald L. Rivest,et al.
Constructing Optimal Binary Decision Trees is NP-Complete
,
1976,
Inf. Process. Lett..
[3]
Thomas M. Cover,et al.
Elements of Information Theory
,
2005
.
[4]
David D. Lewis,et al.
Feature Selection and Feature Extraction for Text Categorization
,
1992,
HLT.
[5]
Marcus Hutter,et al.
Distribution of Mutual Information
,
2001,
NIPS.
[6]
Marco Zaffalon,et al.
Robust Feature Selection by Mutual Information Distributions
,
2002,
UAI.
[7]
Editors
,
2003
.
[8]
Evgueni A. Haroutunian,et al.
Information Theory and Statistics
,
2011,
International Encyclopedia of Statistical Science.