Scalable supervised dimensionality reduction using clustering

The automated targeting of online display ads at scale requires the simultaneous evaluation of a single prospect against many independent models. When deciding which ad to show to a user, one must calculate likelihood-to-convert scores for that user across all potential advertisers in the system. For modern machine-learning-based targeting, as conducted by Media6Degrees (M6D), this can mean scoring against thousands of models in a large, sparse feature space. Dimensionality reduction within this space is useful, as it decreases scoring time and model storage requirements. To meet this need, we develop a novel algorithm for scalable supervised dimensionality reduction across hundreds of simultaneous classification tasks. The algorithm performs hierarchical clustering in the space of model parameters from historical models in order to collapse related features into a single dimension. This allows us to implicitly incorporate feature and label data across all tasks without operating directly in a massive space. We present experimental results showing that for this task our algorithm outperforms other popular dimensionality-reduction algorithms across a wide variety of ad campaigns, as well as production results that showcase its performance in practice.

[1]  Foster Provost,et al.  Audience selection for on-line brand advertising: privacy-friendly social network targeting , 2009, KDD.

[2]  James Parker,et al.  on Knowledge and Data Engineering, , 1990 .

[3]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[4]  Foster J. Provost,et al.  Distribution-based aggregation for relational learning with identifier attributes , 2006, Machine Learning.

[5]  Foster Provost,et al.  Acora: Distribution-Based Aggregation for Relational Learning from Identifier Attributes , 2005 .

[6]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[7]  Foster J. Provost,et al.  Design principles of massive, robust prediction systems , 2012, KDD.

[8]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[9]  Leo Breiman,et al.  Stacked regressions , 2004, Machine Learning.

[10]  Saharon Rosset,et al.  Identifying Bundles of Product Options using Mutual Information Clustering , 2007, SDM.

[11]  Paul Geladi,et al.  Principal Component Analysis , 1987, Comprehensive Chemometrics.

[12]  Rajiv Khanna,et al.  Estimating rates of rare events with multiple hierarchies through scalable log-linear models , 2010, KDD '10.

[13]  Massimiliano Pontil,et al.  Convex multi-task feature learning , 2008, Machine Learning.

[14]  Foster J. Provost,et al.  Bid optimizing and inventory scoring in targeted online advertising , 2012, KDD.

[15]  Xin Geng,et al.  Supervised nonlinear dimensionality reduction for visualization and classification , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[16]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[17]  B. E. Eckbo,et al.  Appendix , 1826, Epilepsy Research.

[18]  Aapo Hyvärinen,et al.  Fast and robust fixed-point algorithms for independent component analysis , 1999, IEEE Trans. Neural Networks.

[19]  Foster J. Provost,et al.  Machine learning for targeted display advertising: transfer learning in action , 2013, Machine Learning.

[20]  Rajat Raina,et al.  Constructing informative priors using transfer learning , 2006, ICML.

[21]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[22]  Rich Caruana,et al.  Multitask Learning: A Knowledge-Based Source of Inductive Bias , 1993, ICML.

[23]  Giorgio Valentini,et al.  Ensembles of Learning Machines , 2002, WIRN.

[24]  George Karypis,et al.  Fast supervised dimensionality reduction algorithm with applications to document categorization & retrieval , 2000, CIKM '00.