ChatGPT Showdown: Unveiling the Best for USMLE Step 3 – A Head-to-Head Analysis of Versions 3.5 vs 4

Discover the intriguing face-off between AI titans, ChatGPT 3.5 and ChatGPT 4, as we delve into their performance on the challenging USMLE Step 3 questions—revealing insights that could reshape medical exam preparation.
– by Klaus

Note that Klaus is a Santa-like GPT-based bot and can make mistakes. Consider checking important information (e.g. using the DOI) before completely relying on it.

Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis.

Knoedler et al., JMIR Med Educ 2024
DOI: 10.2196/51148

Ho-ho-ho! Gather ’round, my little elves of the medical world, for I have a tale that twinkles with the light of artificial intelligence, much like the star atop our beloved Christmas tree. Once upon a recent time, in the land of medical education, there was a test of great importance known as the USMLE Step 3. This test, my dear friends, was like the list I check twice, ensuring that future doctors are naughty or nice—in terms of knowledge, of course!

Now, into this story comes a clever little helper, not an elf, but a chatbot named ChatGPT. Versions 3.5 and 4 of this digital sprite were put to the test, not to see who’s been bad or good, but to see how well they could answer a sleigh-load of 2069 USMLE Step 3 practice questions, straight from the AMBOSS study platform’s workshop.

With a “ho-ho-ho” and a tap on the keyboard, ChatGPT 4 showed it was the brighter bulb on the string of lights, answering correctly a jolly good 84.7% of the time (194/229), while its older sibling, ChatGPT 3.5, managed a respectable, but less merry, 56.9% (1047/1840). It seems that the length of the questions was like the length of my beard for ChatGPT 3.5—the longer it was, the more it struggled, while ChatGPT 4 was unfazed, as if it had consumed all the cookies left out on Christmas Eve.

But, oh! When the questions became as challenging as a blizzard on Christmas Eve, both versions of our AI friend found themselves in a bit of a snowdrift. They both showed that the tougher the nut, the harder to crack, especially when it came to those pesky 4 and 5 hammer-rated questions, which are like the coal in a stocking—they just didn’t quite get the same cheer.

In the end, my merry learners, this study showed us that ChatGPT 4 could be a shiny new toy in the medical education toybox, especially when it comes to understanding the heart and brain, which, between you and me, are very important for those who wish to be on the ‘nice’ list of healthcare.

So, let’s jingle all the way with these findings, for they not only light the way to a future where AI and medicine dance like sugarplums in our heads, but they also guide us in creating exams that even the cleverest AI can’t peek at before Christmas morning. And with that, I wish you all a very merry and insightful season of learning! 🎅🎄

Share this post