论文信息 - Assessing the Potential of USMLE-Like Exam Questions Generated by GPT-4

Assessing the Potential of USMLE-Like Exam Questions Generated by GPT-4

Prior work has shown that large language models like GPT-4 and Med-PaLM 2 can answer sample questions from the USMLE Step 2 Clinical Knowledge (CK) exam with greater than 80% accuracy. But can these generative AI create USMLE-like exam questions? This capability could augment humans in writing or preparing for such exams. Here we assess the ability of GPT-4 to generate realistic exam questions by asking licensed physicians to (1) distinguish AI-generated questions from genuine USMLE Step 2 CK questions, and (2) assess the validity of AI-generated questions and answers. We find that GPT-4 can generate question/answer pairs that are largely indistinguishable from human-generated ones, with a majority (64%) deemed "valid" by a panel of licensed physicians.

[1] E. Horvitz,et al. Capabilities of GPT-4 on Medical Challenge Problems , 2023, ArXiv.

[2] Carrie J. Cai,et al. AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts , 2021, CHI.

[3] S. Díaz,et al. The Cost of Board Examination and Preparation: An Overlooked Factor in Medical Student Debt , 2019, Cureus.

[4] Amy L. Holmstrom. United States Medical Licensing Examination , 2018 .