A Marketplace for Data: An Algorithmic Solution

In this work, we aim to design a data marketplace; a robust real-time matching mechanism to efficiently buy and sell training data for Machine Learning tasks. While the monetization of data and pre-trained models is an essential focus of industry today, there does not exist a market mechanism to price training data and match buyers to sellers while still addressing the associated (computational and other) complexity. The challenge in creating such a market stems from the very nature of data as an asset: (i) it is freely replicable; (ii) its value is inherently combinatorial due to correlation with signal in other data; (iii) prediction tasks and the value of accuracy vary widely; (iv) usefulness of training data is difficult to verify a priori without first applying it to a prediction task. As our main contributions we: (i) propose a mathematical model for a two-sided data market and formally define the key associated challenges; (ii) construct algorithms for such a market to function and analyze how they meet the challenges defined. We highlight two technical contributions: (i) a new notion of "fairness" required for cooperative games with freely replicable goods; (ii) a truthful, zero regret mechanism to auction a class of combinatorial goods based on utilizing Myerson's payment function and the Multiplicative Weights algorithm. These might be of independent interest.

[1]  L. Shapley,et al.  VALUES OF LARGE GAMES. 6: EVALUATING THE ELECTORAL COLLEGE EXACTLY , 1962 .

[2]  D. Bertsekas Nondifferentiable optimization via approximation , 1975 .

[3]  Roger B. Myerson,et al.  Optimal Auction Design , 1981, Math. Oper. Res..

[4]  Daniel P. Heyman,et al.  Stochastic models in operations research , 1982 .

[5]  J. Arthur Stochastic Models in Operations Research, Volume II. Stochastic Optimization (Daniel P. Heyman and Matthew J. Sobel) , 1985 .

[6]  L. Shapley A Value for n-person Games , 1988 .

[7]  Yuval Ishai,et al.  Priced Oblivious Transfer: How to Sell Digital Goods , 2001, EUROCRYPT.

[8]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[9]  J. Rochet,et al.  Platform competition in two sided markets , 2003 .

[10]  B. Caillaud,et al.  Chicken & Egg: Competition Among Intermediation Service Providers , 2003 .

[11]  J. Wolfers,et al.  Prediction Markets , 2003 .

[12]  Andreas Krause,et al.  Near-optimal Nonmyopic Value of Information in Graphical Models , 2005, UAI.

[13]  Jianqing Chen,et al.  Designing online auctions with past performance information , 2006, Decis. Support Syst..

[14]  Evangelos Markakis,et al.  Approximating power indices: theoretical and empirical analysis , 2010, Autonomous Agents and Multi-Agent Systems.

[15]  H. Varian Online Ad Auctions , 2009 .

[16]  Daniel Gómez,et al.  Polynomial calculation of the Shapley value based on sampling , 2009, Comput. Oper. Res..

[17]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[18]  Aaron Roth,et al.  Selling privacy at auction , 2010, EC '11.

[19]  Jeff A. Bilmes,et al.  Online Submodular Minimization for Combinatorial Structures , 2011, ICML.

[20]  Abhimanyu Das,et al.  Submodular meets Spectral: Greedy Algorithms for Subset Selection, Sparse Approximation and Dictionary Selection , 2011, ICML.

[21]  R. Hanson LOGARITHMIC MARKETS CORING RULES FOR MODULAR COMBINATORIAL INFORMATION AGGREGATION , 2012 .

[22]  Aaron Roth,et al.  Take It or Leave It: Running a Survey When Privacy Comes at a Cost , 2012, WINE.

[23]  A. Goshtasby Similarity and Dissimilarity Measures , 2012 .

[24]  A. Ardeshir Goshtasby,et al.  Image Registration , 2012, Advances in Computer Vision and Pattern Recognition.

[25]  Renato Paes Leme,et al.  Optimal mechanisms for selling information , 2012, EC '12.

[26]  Sanjeev Arora,et al.  The Multiplicative Weights Update Method: a Meta-Algorithm and Applications , 2012, Theory Comput..

[27]  Talal Rahwan,et al.  Bounding the Estimation Error of Sampling-based Shapley Value Approximation With/Without Stratifying , 2013, ArXiv.

[28]  Yungao Ma,et al.  The bullwhip effect on product orders and inventory: a perspective of demand forecasting techniques , 2013 .

[29]  Aranyak Mehta,et al.  Online Matching and Ad Allocation , 2013, Found. Trends Theor. Comput. Sci..

[30]  Francis R. Bach,et al.  Learning with Submodular Functions: A Convex Optimization Perspective , 2011, Found. Trends Mach. Learn..

[31]  Renato D. Gomes,et al.  Optimal auction design in two‐sided markets , 2014 .

[32]  Weinan Zhang,et al.  Optimal real-time bidding for display advertising , 2014, KDD.

[33]  Constantinos Daskalakis,et al.  Multi-item auctions defying intuition? , 2015, SECO.

[34]  R. Johari,et al.  Pricing in Ride-Share Platforms: A Queueing-Theoretic Approach , 2015 .

[35]  Aaron Roth,et al.  Accuracy for Sale: Aggregating Data with a Variance Constraint , 2015, ITCS.

[36]  Carlos Riquelme,et al.  Pricing in Ride-Sharing Platforms: A Queueing-Theoretic Approach , 2015, EC.

[37]  M. Keith Chen,et al.  Dynamic Pricing in a Labor Market: Surge Pricing and Flexible Work on the Uber Platform , 2016, EC.

[38]  Constantinos Daskalakis,et al.  Learning in Auctions: Regret is Hard, Envy is Easy , 2015, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[39]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[40]  Dirk Bergemann,et al.  The Design and Price of Information , 2016 .

[41]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[42]  D. Bergemann,et al.  The Design and Price of Information , 2016 .

[43]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[44]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[45]  Sergei Vassilvitskii,et al.  Statistical Cost Sharing , 2017, NIPS.

[46]  Aleksander Madry,et al.  Adversarially Robust Generalization Requires More Data , 2018, NeurIPS.