Research letter
The Performance of Artificial Intelligence on a National Medical Licensing Examination
The Answers of Large Language Models to Text Questions
; ; ; ; ; ; ;


Significant advances are being made in the development of large language models (LLM) for use in the field of clinical medicine. LLM are now able to pass major medical exams such as the United States Medical Licensing Examination (USMLE) and two parts of the German medical licensing examinations (first and second state examinations) (1, 2). While the reasoning capability of LLM seems promising, achievement of their full potential in medicine depends on successful practical implementation. One possible area of application in medicine is medical education (3). Here, we present a comprehensive evaluation of the performance of three well-established, popular, and easily accessible LLM—GPT-3.5, GPT-4.0, and Gemini Pro—on the first (preclinical, M1) and second (clinical, M2) parts of the German medical licensing examination.
Methods
Test questions
The original questions of the written medical exams M1 and M2 from fall 2023, their degrees of difficulty, and the correct answers were extracted from the AMBOSS platform (www.next.amboss.com/de/). Questions which incorporated illustrations (M1: N = 51; M2: N = 48) or were withdrawn by the Institute for Medical and Pharmaceutical Proficiency Assessment (IMPP) (M1: N = 7; M2: N = 5) were excluded. A total of 262 questions from the M1 and 267 questions from the M2 were included. In line with previous studies, M1 questions were classified by topic alone, M2 questions by question type (case-based [N = 168] and stand-alone [N = 99]) as well as by topic (2).
Testing
The following LLM were used between 12 December and 19 December 2023:
GPT-3.5 (www.chat.openai.com/?model=text-davinci-002-render-sha)
GPT-4.0 (www.chatgpt.com/?model=gpt-4)
Gemini Pro (used via Google Bard (www.bard.google.com/chat)
The questions were entered in German without additional prompts in order to analyze a zero shot approach and make comparisons with the correct IMPP answers. All research procedures were conducted in accordance with the Declaration of Helsinki.
Statistics
Descriptive statistics were calculated for the success rates of each model separately for the M1 and the M2, considering subject-specific performance, differences in performance between case-based and stand-alone questions, and variability between questions of varying levels of difficulty (AMBOSS difficulty).
Results
The aim of our study was to evaluate the ability of three publicly available LLM to answer questions posed in German in the German medical licensing examinations. All models attained the minimum passing score (60%) in both the M1 (157 points) and the M2 (160 points). GPT-4.0 achieved the best result (93.1%) in the M1, followed by GPT-3.5 (77.1%) and Gemini Pro (74.8%). In the M2, GPT-4.0 achieved 94%, outperforming GPT-3.5 (76.4%) and Gemini Pro (65.5%). In comparison with the cohort of medical students (www.impp.de/pruefungen/medizin/archiv-medizin.html), who averaged 230 correct answers in the M1 and 233 points in the M2 (all questions), GPT-4.0 answered more questions correctly, achieving scores of 244 (M1) and 251 (M2) (text questions only) even without tackling image questions. Analysis of the individual M1 questions showed that GPT-4.0 achieved the best results overall. GPT-4.0 was also superior to GPT-3.5 and Gemini Pro in the M2; only in the field of dermatology did all three models score equally (Table). GPT-4.0 performed best on both case-based and stand-alone questions. All models performed less well on more difficult questions.
Discussion
Our study shows that LLM are able to pass the written questions in the German medical licensing examination. In our study, GPT-4.0 performed better than the other two models. Jung et al. were the first to assess how well LLM performed on M1 and M2 questions, finding a success rate of over 60% for GPT-3.5 (2). In our study, GPT-3.5 achieved even better results, over 70%. Brin et al., showed comparable results for GPT-4, which answered 90% of USMLE exam questions correctly, thus outperforming GPT-3.5 (62.5%) (4). This confirms that GPT-4.0 is able to solve difficult medical exam questions. Although all models passed the exams, there were subtle differences when it came to specific subjects, such as entorhinolaryngology (ENT). In comparison, Hoch et al., found that GPT-4.0 achieved 63% in the specialty ENT (5). The differences in model performance and between subject areas may be due to use of different training data, varying technical capabilities, and questions differing in difficulty among topics. The lack of image questions limits the generalizability of our results due to possible differences from text questions we tested. Nevertheless GPT-4.0 outperformed students’ results even without tackling image questions.
Conclusion
Large language models achieved passing scores in the German medical licensing examination and on occasion outperformed medical students.
Mark Enrik Geissler, Merle Goeben, Kira A. Glasmacher, Jean-Paul Bereuter, Rona Berit Geissler, Isabella C. Wiest, Fiona R. Kolbinger, Jakob Nikolas Kather
Conflict of interest statement
JNK has received honoraria for consulting services from Owkin, France; DoMore Diagnostics, Norway; Panakeia, UK; Scailyte, Switzerland; Cancilico, Germany; Mindpeak, Germany; MultiplexDx, Slovakia; and Histofy, UK. Furthermore, he holds shares in StratifAI GmbH, Germany. He has received a research grant from GSK, and has received honoraria for lectures/consulting services from AstraZeneca, Bayer, Eisai, Janssen, MSD, BMS, Roche, Pfizer, and Fresenius.
ICW has received honoraria for lectures from AstraZeneca.
The remaining authors declare that they have no conflict of interests.
Manuscript received on 23 June 2024, revised version accepted on 25 October 2024
Cite this as:
Geissler ME, Goeben M, Glasmacher KA, Bereuter JP, Geissler RB, Wiest IC, Kolbinger FR, Kather JN: The performance of artificial intelligence on a national medical licensing examination—the answers of large language models to text questions. Dtsch Arztebl Int 2024; 121: 888–9. DOI: 10.3238/arztebl.m2024.0231
performance in USMLE soft skill assessments. Sci Rep 2023; 13: 16492 CrossRef MEDLINE PubMed Central
mark_enrik.geissler@tu-dresden.de
Department of Trauma Surgery, Orthopedics, and Plastic Surgery, University Hospital and Faculty of Medicine, University of Göttingen (Goeben)
Emmanuel College, Boston, MA, USA (Glasmacher)
Else Kröner Fresenius Center for Digital Health, Technical University of Dresden (Wiest, Kolbinger, Kather)
Department of Medicine, Mannheim Faculty of Medicine, University of Heidelberg, Mannheim (Wiest)
Weldon School of Biomedical Engineering, Purdue University, West Lafayette, Indiana, USA (Kolbinger)
Regenstrief Center for Healthcare Engineering (RCHE), Purdue University, West Lafayette, Indiana, USA (Kolbinger)
Department of Medical Oncology, National Center for Tumor Diseases (NCT), University Hospital Heidelberg (Kather)
Department of Medicine, University Hospital Dresden (Kather)
1. | Kung TH, Cheatham M, Medenilla A, et al.: Performance of ChatGPT on USMLE: potential for ai-assisted medical education using large language models. PLOS Digit Health 2023; 2: e0000198 CrossRef MEDLINE PubMed Central |
2. | Jung LB, Gudera JA, Wiegand TLT, Allmendinger S, Dimitriadis K, Koerte IK: ChatGPT passes German state examination in medicine with picture questions omitted. Dtsch Arztebl Int 2023; 120: 373–4 CrossRef VOLLTEXT |
3. | Clusmann J, Kolbinger FR, Muti HS, et al.: The future landscape of large language models in medicine. Commun Med 2023; 3: 1–8 CrossRef MEDLINE PubMed Central |
4. | Brin D, Sorin V, Vaid A, et al.: Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Sci Rep 2023; 13: 16492 CrossRef MEDLINE PubMed Central |
5. | Hoch CC, Wollenberg B, Lüers JC, et al.: ChatGPT’s quiz skills in different otolaryngology subspecialties: an analysis of 2 576 single-choice and multiple-choice board certification preparation questions. Eur Arch Otorhinolaryngol 2023; 280: 4271–8 CrossRef MEDLINE PubMed Central |