Aligning Synthetic Medical Images with Clinical Knowledge using Human Feedback

Generative models capable of capturing nuanced clinical features in medical images hold great promise for facilitating clinical data sharing, enhancing rare disease datasets, and efficiently synthesizing annotated medical images at scale. Despite their potential, assessing the quality of synthetic medical images remains a challenge. While modern generative models can synthesize visually-realistic medical images, the clinical validity of these images may be called into question. Domain-agnostic scores, such as FID score, precision, and recall, cannot incorporate clinical knowledge and are, therefore, not suitable for assessing clinical sensibility. Additionally, there are numerous unpredictable ways in which generative models may fail to synthesize clinically plausible images, making it challenging to anticipate potential failures and manually design scores for their detection. To address these challenges, this paper introduces a pathologist-in-the-loop framework for generating clinically-plausible synthetic medical images. Starting with a diffusion model pretrained using real images, our framework comprises three steps: (1) evaluating the generated images by expert pathologists to assess whether they satisfy clinical desiderata, (2) training a reward model that predicts the pathologist feedback on new samples, and (3) incorporating expert knowledge into the diffusion model by using the reward model to inform a finetuning objective. We show that human feedback significantly improves the quality of synthetic images in terms of fidelity, diversity, utility in downstream applications, and plausibility as evaluated by experts.

[1]  Yotam I. Gingold,et al.  Text-guided Image-and-Shape Editing and Generation: A Short Survey , 2023, ArXiv.

[2]  T. Zhang,et al.  RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment , 2023, ArXiv.

[3]  Yuxiao Dong,et al.  ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation , 2023, NeurIPS.

[4]  Hongsheng Li,et al.  Better Aligning Text-to-Image Models with Human Preference , 2023, ArXiv.

[5]  P. Abbeel,et al.  Aligning Text-to-Image Models using Human Feedback , 2023, ArXiv.

[6]  Tom B. Brown,et al.  Constitutional AI: Harmlessness from AI Feedback , 2022, ArXiv.

[7]  Jakob Nikolas Kather,et al.  A multimodal comparison of latent denoising diffusion probabilistic models and generative adversarial networks for medical image synthesis , 2022, Scientific reports.

[8]  Ehsan Khodapanah Aghdam,et al.  Diffusion models in medical imaging: A comprehensive survey , 2022, Medical Image Anal..

[9]  Jakob Nikolas Kather,et al.  Medical Diffusion: Denoising Diffusion Probabilistic Models for 3D Medical Image Generation , 2022, 2211.03364.

[10]  Zubair Shah,et al.  Spot the fake lungs: Generating Synthetic Medical Images using Neural Diffusion Models , 2022, AICS.

[11]  Florian Thamm,et al.  Generation of Anonymous Chest Radiographs Using Latent Diffusion Models for Training Thoracic Abnormality Classification Systems , 2022, 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI).

[12]  Ludwig Schmidt,et al.  LAION-5B: An open large-scale dataset for training next generation image-text models , 2022, NeurIPS.

[13]  P. Chambon,et al.  Adapting Pretrained Vision-Language Foundational Models to Medical Imaging Domains , 2022, ArXiv.

[14]  H. Aerts,et al.  What Does DALL-E 2 Know About Radiology? , 2022, Journal of medical Internet research.

[15]  S. Ourselin,et al.  Brain Imaging Generation with Latent Diffusion Models , 2022, DGM4MICCAI@MICCAI.

[16]  Jonathan Ho Classifier-Free Diffusion Guidance , 2022, ArXiv.

[17]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[18]  Tom B. Brown,et al.  Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , 2022, ArXiv.

[19]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[20]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Jeff Wu,et al.  WebGPT: Browser-assisted question-answering with human feedback , 2021, ArXiv.

[22]  Christian Münzenmayer,et al.  Highly accurate differentiation of bone marrow cell morphologies using deep neural networks on a large image data set , 2021, Blood.

[23]  Ming Y. Lu,et al.  Synthetic data in machine learning for medicine and healthcare , 2021, Nature Biomedical Engineering.

[24]  Prafulla Dhariwal,et al.  Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[25]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[26]  Prafulla Dhariwal,et al.  Improved Denoising Diffusion Probabilistic Models , 2021, ICML.

[27]  M. Schaar,et al.  How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating and Auditing Generative Models , 2021, ICML.

[28]  Marcin Wladyslaw Wolter,et al.  Quality assessment of compressed and resized medical images based on pattern recognition using a convolutional neural network , 2020, Commun. Nonlinear Sci. Numer. Simul..

[29]  Ryan J. Lowe,et al.  Learning to summarize from human feedback , 2020, NeurIPS 2020.

[30]  Ming Dong,et al.  Attention-Guided Generative Adversarial Network to Address Atypical Anatomy in Synthetic CT Generation , 2020, 2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science (IRI).

[31]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[32]  Gorka Epelde,et al.  Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing , 2020, JMIR medical informatics.

[33]  Bernhard Kainz,et al.  A Survey on Active Learning and Human-in-the-Loop Deep Learning for Medical Image Analysis , 2019, Medical Image Anal..

[34]  Yang Lei,et al.  Deep learning-based image quality improvement for low-dose computed tomography simulation in radiation therapy , 2019, Journal of medical imaging.

[35]  Tom B. Brown,et al.  Fine-Tuning Language Models from Human Preferences , 2019, ArXiv.

[36]  Yang Song,et al.  Generative Modeling by Estimating Gradients of the Data Distribution , 2019, NeurIPS.

[37]  Richard J. Chen,et al.  Deep Adversarial Training for Multi-Organ Nuclei Segmentation in Histopathology Images , 2018, IEEE Transactions on Medical Imaging.

[38]  Olivier Bachem,et al.  Assessing Generative Models via Precision and Recall , 2018, NeurIPS.

[39]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[40]  Muhammad Imran Razzak,et al.  Deep Learning for Medical Image Processing: Overview, Challenges and Future , 2017, ArXiv.

[41]  Daniel Lévy,et al.  Breast Mass Classification from Mammograms using Deep Convolutional Neural Networks , 2016, ArXiv.

[42]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[44]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[45]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[46]  Stefan Winkler,et al.  The Evolution of Video Quality Measurement: From PSNR to Hybrid Metrics , 2008, IEEE Transactions on Broadcasting.

[47]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[48]  Lowell Scott Smith,et al.  Signal, Noise, and Contrast in Nuclear Magnetic Resonance (NMR) Imaging , 1983, Journal of computer assisted tomography.

[49]  H. Nyquist,et al.  Certain factors affecting telegraph speed , 1924, Journal of the A.I.E.E..