HAPI: A Large-scale Longitudinal Dataset of Commercial ML API Predictions

Commercial ML APIs offered by providers such as Google, Amazon and Microsoft have dramatically simplified ML adoption in many applications. Numerous companies and academics pay to use ML APIs for tasks such as object detection, OCR and sentiment analysis. Different ML APIs tackling the same task can have very heterogeneous performance. Moreover, the ML models underlying the APIs also evolve over time. As ML APIs rapidly become a valuable marketplace and a widespread way to consume machine learning, it is critical to systematically study and compare different APIs with each other and to characterize how APIs change over time. However, this topic is currently underexplored due to the lack of data. In this paper, we present HAPI (History of APIs), a longitudinal dataset of 1,761,417 instances of commercial ML API applications (involving APIs from Amazon, Google, IBM, Microsoft and other providers) across diverse tasks including image tagging, speech recognition and text mining from 2020 to 2022. Each instance consists of a query input for an API (e.g., an image or text) along with the API's output prediction/annotation and confidence scores. HAPI is the first large-scale dataset of ML API usages and is a unique resource for studying ML-as-a-service (MLaaS). As examples of the types of analyses that HAPI enables, we show that ML APIs' performance change substantially over time--several APIs' accuracies dropped on specific benchmark datasets. Even when the API's aggregate performance stays steady, its error modes can shift across different subtypes of data between 2020 and 2022. Such changes can substantially impact the entire analytics pipelines that use some ML API as a component. We further use HAPI to study commercial APIs' performance disparities across demographic subgroups over time. HAPI can stimulate more research in the growing field of MLaaS.

[1]  James Y. Zou,et al.  Estimating and Explaining Model Performance When Both Covariates and Labels Shift , 2022, NeurIPS.

[2]  Emily Denton,et al.  Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research , 2021, NeurIPS Datasets and Benchmarks.

[3]  Matei Zaharia,et al.  Did the Model Change? Efficiently Assessing Machine Learning API Shifts , 2021, ArXiv.

[4]  Trevor Darrell,et al.  Predicting with Confidence on Unseen Distributions , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Mayee F. Chen,et al.  Mandoline: Model Evaluation under Distribution Shift , 2021, ICML.

[6]  Frederick Liu,et al.  Detecting Errors and Estimating Accuracy on Unlabeled Data with Self-training Ensembles , 2021, NeurIPS.

[7]  Olga Russakovsky,et al.  Understanding and Evaluating Racial Biases in Image Captioning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Stephen Gould,et al.  What Does Rotation Prediction Tell Us about Classifier Accuracy under Varying Testing Environments? , 2021, ICML.

[9]  James Y. Zou,et al.  Efficient Online ML API Selection for Multi-Label Classification Tasks , 2021, ICML.

[10]  S. Chiappa,et al.  Fairness in Machine Learning , 2020, INNSBDDL.

[11]  Jun Zhou,et al.  PP-OCR: A Practical Ultra Lightweight OCR System , 2020, ArXiv.

[12]  Liang Zheng,et al.  Are Labels Always Necessary for Classifier Accuracy Evaluation? , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Ching-Yao Chuang,et al.  Estimating Generalization under Distribution Shifts via Domain-Invariant Representations , 2020, ICML.

[14]  Matei Zaharia,et al.  FrugalML: How to Use ML Prediction APIs More Accurately and Cheaply , 2020, NeurIPS.

[15]  Peter Bailis,et al.  Model Assertions for Monitoring and Improving ML Models , 2020, MLSys.

[16]  Lianwen Jin,et al.  ICDAR 2019 Competition on Large-Scale Street View Text with Partial Labeling - RRC-LSVT , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[17]  Kai Zhou,et al.  ICDAR 2019 Robust Reading Challenge on Reading Chinese Text on Signboard , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[18]  Mario Fritz,et al.  Prediction Poisoning: Towards Defenses Against DNN Model Stealing Attacks , 2019, ICLR.

[19]  Carlos Frederico de Brito d'Andréa,et al.  Studying the Live Cross-Platform Circulation of Images With Computer Vision API: An Experiment Based on a Sports Media Event , 2019 .

[20]  Yoshua Bengio,et al.  Speech Model Pre-training for End-to-End Spoken Language Understanding , 2019, INTERSPEECH.

[21]  Tribhuvanesh Orekondy,et al.  Knockoff Nets: Stealing Functionality of Black-Box Models , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Lianwen Jin,et al.  ICPR2018 Contest on Robust Reading for Multi-Type Web Images , 2018, 2018 24th International Conference on Pattern Recognition (ICPR).

[23]  Sharad Goel,et al.  The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning , 2018, ArXiv.

[24]  S. Lapuschkin,et al.  AudioMNIST: Exploring Explainable Artificial Intelligence for audio analysis on a simple benchmark , 2018, J. Frankl. Inst..

[25]  Arun Kumar,et al.  Model-based Pricing for Machine Learning in a Data Marketplace , 2018, ArXiv.

[26]  Samuel Marchal,et al.  PRADA: Protecting Against DNN Model Stealing Attacks , 2018, 2019 IEEE European Symposium on Security and Privacy (EuroS&P).

[27]  Pete Warden,et al.  Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.

[28]  Arsénio Reis,et al.  Using Online Artificial Vision Services to Assist the Blind - an Assessment of Microsoft Cognitive Services and Google Cloud Vision , 2018, WorldCIST.

[29]  Timnit Gebru,et al.  Datasheets for datasets , 2018, Commun. ACM.

[30]  Esther Rolf,et al.  Delayed Impact of Fair Machine Learning , 2018, ICML.

[31]  Timnit Gebru,et al.  Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , 2018, FAT.

[32]  Mohammad H. Mahoor,et al.  AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild , 2017, IEEE Transactions on Affective Computing.

[33]  Xin Zhang,et al.  TFX: A TensorFlow-Based Production-Scale Machine Learning Platform , 2017, KDD.

[34]  Junping Du,et al.  Reliable Crowdsourcing and Deep Locality-Preserving Learning for Expression Recognition in the Wild , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Radha Poovendran,et al.  Google's Cloud Vision API is Not Robust to Noise , 2017, 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA).

[36]  V. Sze,et al.  Hardware for machine learning: Challenges and opportunities , 2016, 2018 IEEE Custom Integrated Circuits Conference (CICC).

[37]  A. Juels,et al.  Stealing Machine Learning Models via Prediction APIs , 2016, USENIX Security Symposium.

[38]  Emad Barsoum,et al.  Training deep networks for facial expression recognition with crowd-sourced label distribution , 2016, ICMI.

[39]  A. Ng,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[40]  Somesh Jha,et al.  Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures , 2015, CCS.

[41]  Xiaoou Tang,et al.  Learning Social Relation Traits from Face Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[42]  Luc Van Gool,et al.  The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[43]  Eric Gilbert,et al.  VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text , 2014, ICWSM.

[44]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[45]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[46]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[47]  Информатика,et al.  Microsoft Speech API , 2010 .

[48]  Krishnakumar Balasubramanian,et al.  Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels , 2010, J. Mach. Learn. Res..

[49]  Mark J. Huiskes,et al.  The MIR flickr retrieval evaluation , 2008, MIR '08.

[50]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[51]  Neoklis Polyzotis,et al.  Data Validation for Machine Learning , 2019, SysML.

[52]  R. Cardell-Oliver,et al.  Dataset , 2019, Proceedings of the 2nd Workshop on Data Acquisition To Analysis - DATA'19.

[53]  Johan Bos,et al.  The Groningen Meaning Bank , 2013, JSSP.