Assessing the Potential of USMLE-Like Exam Questions Generated by GPT-4

Prior work has shown that large language models like GPT-4 and Med-PaLM 2 can answer sample questions from the USMLE Step 2 Clinical Knowledge (CK) exam with greater than 80% accuracy. But can these generative AI create USMLE-like exam questions? This capability could augment humans in writing or preparing for such exams. Here we assess the ability of GPT-4 to generate realistic exam questions by asking licensed physicians to (1) distinguish AI-generated questions from genuine USMLE Step 2 CK questions, and (2) assess the validity of AI-generated questions and answers. We find that GPT-4 can generate question/answer pairs that are largely indistinguishable from human-generated ones, with a majority (64%) deemed "valid" by a panel of licensed physicians.