Tiffany Kung, MD

Tiffany Kung, MD, a resident in the Department of Anesthesia, Critical Care and Pain Medicine at Massachusetts General Hospital, is the lead author of a new research article in PLOS Digital Health, Performance of ChatGPT on USMLE: Potential for AI-assisted Medical Education Using Large Language Models.

 She conducted this research with AnsibleHealth, a technology-enhanced medical practice that provides expert care to medically complex patients with chronic respiratory disease, such as COPD. 

What Question Were You Investigating?

ChatGPT is an advanced artificial intelligence (AI) chatbot developed by Open AI. ChatGPT is a generative Large-Learning Model (LLM), engineered to produce human-like writing by anticipating the sequence of words to come next. Unlike conventional chatbots, ChatGPT lacks the capability to search the web, and instead generates text by utilizing the word associations predicted by its internal algorithms.

Our team at AnsibleHealth wanted to see how well ChatGPT would perform on the United States Medical Licensing Exam (USMLE)—a set of three standardized tests of expert level knowledge that individuals are required to pass for medical licensure in the United States.

What Approach Did You Use?

We obtained publicly available test questions from the June 2022 sample exam release from the official USMLE website. The questions were screened and those requiring visual assessments were removed.

All inputs represented true out-of-training samples for the GPT3 model and the team checked to ensure that none of the answers, explanations or related content was available on Google prior to Jan. 1, 2022, the date of the last available training set.

The questions were formatted into three variants:

  • Open-ended prompting ("What would be the patient’s diagnosis based on the information provided?")
  • Multiple choice single answer without forced justification ("The patient's condition is mostly caused by which of the following pathogens?")
  • Multiple choice single answer with forced justification ("Which of the following for is the most likely reason for the patient’s nocturnal symptoms? Explain your rationale for each choice.")

Each question was input into ChatGPT in separate chats to reduce retention bias.

What Were Your Findings?

We found that ChatGPT performed at or near the passing threshold of 60% accuracy. Being the first to achieve this benchmark is a notable milestone in AI maturation, and notably, ChatGPT was able to achieve these results without specialized input from clinician trainers.

Furthermore, ChatGPT displayed understandable reasoning and valid clinical insights, leading to increased confidence in trust and explainability.

The study suggests that large language models such as ChatGPT may potentially assist human learners in medical education and could be a prelude to further integration of AI in clinical settings. As an example, clinicians at AnsibleHealth are already utilizing ChatGPT to translate technical medical reports into more easily understandable language for patients.

About the Massachusetts General Hospital

Massachusetts General Hospital, founded in 1811, is the original and largest teaching hospital of Harvard Medical School. The Mass General Research Institute conducts the largest hospital-based research program in the nation, with annual research operations of more than $1 billion and comprises more than 9,500 researchers working across more than 30 institutes, centers and departments. In July 2022, Mass General was named #8 in the U.S. News & World Report list of "America’s Best Hospitals." MGH is a founding member of the Mass General Brigham healthcare system.