Combining Human Predictions with Model Probabilities via Confusion Matrices and Calibration

An increasingly common use case for machine learning models is augmenting the abilities of human decision makers. For classification tasks where neither the human nor model are perfectly accurate, a key step in obtaining high performance is combining their individual predictions in a manner that leverages their relative strengths. In this work, we develop a set of algorithms that combine the probabilistic output of a model with the class-level output of a human. We show theoretically that the accuracy of our combination model is driven not only by the individual human and model accuracies, but also by the model’s confidence. Empirical results on image classification with CIFAR-10 and a subset of ImageNet demonstrate that such human-model combinations consistently have higher accuracies than the model or human alone, and that the parameters of the combination method can be estimated effectively with as few as ten labeled datapoints.

[1]  Padhraic Smyth,et al.  Bayesian modeling of human–AI complementarity , 2022, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Zahra Zahedi,et al.  Human-AI Symbiosis: A Survey of Current Approaches , 2021, ArXiv.

[3]  M. Gomez-Rodriguez,et al.  Differentiable Learning Under Triage , 2021, NeurIPS.

[4]  E. Horvitz,et al.  Is the Most Accurate AI the Best Teammate? Optimizing AI for Teamwork , 2021, AAAI.

[5]  Michael Pfeiffer,et al.  Multi-Class Uncertainty Calibration via Mutual Information Maximization-based Binning , 2021, ICLR.

[6]  A. Ng,et al.  CheXaid: deep learning assistance for physician diagnosis of tuberculosis using chest x-rays in patients with HIV , 2020, npj Digital Medicine.

[7]  Felix A. Wichmann,et al.  Beyond accuracy: quantifying trial-by-trial behaviour of CNNs and humans by measuring error consistency , 2020, NeurIPS.

[8]  Jaime Fern'andez del R'io,et al.  Array programming with NumPy , 2020, Nature.

[9]  David Sontag,et al.  Consistent Estimators for Learning to Defer to an Expert , 2020, ICML.

[10]  Eric Horvitz,et al.  Learning to Complement Humans , 2020, IJCAI.

[11]  Bhavya Kailkhura,et al.  Mix-n-Match: Ensemble and Compositional Methods for Uncertainty Calibration in Deep Learning , 2020, ICML.

[12]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[13]  A. J. Mariano,et al.  Human–machine partnership with artificial intelligence for chest radiograph diagnosis , 2019, npj Digital Medicine.

[14]  Percy Liang,et al.  Verified Uncertainty Calibration , 2019, NeurIPS.

[15]  Thomas Serre,et al.  Deep Learning: The Good, the Bad, and the Ugly. , 2019, Annual review of vision science.

[16]  Peter A. Flach,et al.  Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with Dirichlet calibration , 2019, NeurIPS.

[17]  Thomas L. Griffiths,et al.  Human Uncertainty Makes Classification More Robust , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Himanshu Asnani,et al.  CCMI : Classifier based Conditional Mutual Information Estimation , 2019, UAI.

[19]  Alonso H. Vera,et al.  No AI Is an Island: The Case for Teaming Intelligence , 2019, AI Mag..

[20]  Jon M. Kleinberg,et al.  The Algorithmic Automation Problem: Prediction, Triage, and Human Effort , 2019, ArXiv.

[21]  Jilles Vreeken,et al.  Testing Conditional Independence on Discrete Data using Stochastic Complexity , 2019, AISTATS.

[22]  Laura Trouille,et al.  Citizen science frontiers: Efficiency, engagement, and serendipitous discovery with human–machine systems , 2019, Proceedings of the National Academy of Sciences.

[23]  Mark O. Riedl Human-Centered Artificial Intelligence and Machine Learning , 2019, Human Behavior and Emerging Technologies.

[24]  Noah D. Goodman,et al.  Pyro: Deep Universal Probabilistic Programming , 2018, J. Mach. Learn. Res..

[25]  Thomas B. Berrett,et al.  The conditional permutation test for independence while controlling for confounders , 2018, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[26]  A. Ng,et al.  Deep-learning-assisted diagnosis for knee magnetic resonance imaging: Development and retrospective validation of MRNet , 2018, PLoS medicine.

[27]  Lior Rokach,et al.  Ensemble learning: A survey , 2018, WIREs Data Mining Knowl. Discov..

[28]  Swami Sankaranarayanan,et al.  Face recognition accuracy of forensic examiners, superrecognizers, and face recognition algorithms , 2018, Proceedings of the National Academy of Sciences.

[29]  John K. Tsotsos,et al.  Totally Looks Like - How Humans Compare, Compared to Machines , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[30]  Kyle W. Willett,et al.  Integrating human and machine intelligence in galaxy morphology classification tasks , 2018, 1802.08713.

[31]  Toniann Pitassi,et al.  Predict Responsibly: Improving Fairness and Accuracy by Learning to Defer , 2017, NeurIPS.

[32]  Jakob Runge,et al.  Conditional independence testing based on a nearest-neighbor estimator of conditional mutual information , 2017, AISTATS.

[33]  Jure Leskovec,et al.  Human Decisions and Machine Predictions , 2017, The quarterly journal of economics.

[34]  John L. Tonry,et al.  A transient search using combined human and machine classifications , 2017, 1707.05223.

[35]  Pietro Perona,et al.  Lean Crowdsourcing: Combining Humans and Machines in an Online System , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[37]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  M. Lee,et al.  The relationship between crowd majority and accuracy for binary decisions , 2017, Judgment and Decision Making.

[39]  Jennifer Wortman Vaughan Making Better Use of the Crowd: How Crowdsourcing Can Advance Machine Learning Research , 2017, J. Mach. Learn. Res..

[40]  Ece Kamar,et al.  Directions in Hybrid Intelligence: Complementing AI Systems with Human Intelligence , 2016, IJCAI.

[41]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[42]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  David V. Budescu,et al.  The Composition of Optimally Wise Crowds , 2015, Decis. Anal..

[44]  Fei-Fei Li,et al.  Best of both worlds: Human-machine collaboration for object annotation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[47]  Yashesh Gaur,et al.  Using keyword spotting to help humans correct captioning faster , 2015, INTERSPEECH.

[48]  Andrew Gelman,et al.  The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo , 2011, J. Mach. Learn. Res..

[49]  Eric Horvitz,et al.  Combining human and machine intelligence in large-scale crowdsourcing , 2012, AAMAS.

[50]  Scott E. Page,et al.  Optimal Forecasting Groups , 2012, Manag. Sci..

[51]  Hyun-Chul Kim,et al.  Bayesian Classifier Combination , 2012, AISTATS.

[52]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[53]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[54]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[55]  Ludmila I. Kuncheva,et al.  On the optimality of Naïve Bayes with dependent binary features , 2006, Pattern Recognit. Lett..

[56]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[57]  Lu Hong,et al.  Groups of diverse problem solvers can outperform groups of high-ability problem solvers. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[58]  Thomas G. Dietterich Ensemble Methods in Machine Learning , 2000, Multiple Classifier Systems.

[59]  Jack B. Soll,et al.  Overconfidence: It Depends on How, What, and Whom You Ask. , 1999, Organizational behavior and human decision processes.

[60]  Josef Kittler,et al.  Combining classifiers , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[61]  D Kahneman,et al.  On the reality of cognitive illusions. , 1996, Psychological review.

[62]  Robert A. Jacobs,et al.  Methods For Combining Experts' Probability Assessments , 1995, Neural Computation.

[63]  Adam Krzyżak,et al.  Methods of combining multiple classifiers and their applications to handwriting recognition , 1992, IEEE Trans. Syst. Man Cybern..

[64]  G. Keren Calibration and probability judgements: Conceptual and methodological issues , 1991 .

[65]  Christian Genest,et al.  Combining Probability Distributions: A Critique and an Annotated Bibliography , 1986 .

[66]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[67]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .