Artificial intelligence chatbots failed to accurately generate a list of possible diagnoses based on initial patient symptoms more than 80% of the time, but improved considerably when given more information, according to a recent study conducted by researchers at Somerville, Mass.-based Mass General Brigham.
The findings from “Large Language Model Performance and Clinical Reasoning Tasks” were published April 13 in JAMA Network Open.
Five things to know:
1. The researchers set out to determine if large language models can demonstrate reliable performance across clinical workflows. They studied 21 AI models using 29 standardized medical case scenarios drawn from the MSD Manual, a peer-reviewed clinical reference used to train medical professionals, representing 16,254 responses in total. Medical students scored each model’s responses against established answer keys. Analyses were conducted between January and December 2025. Real-time web search and other add-on features were disabled.
2. The AI models were walked through the steps of a real patient encounter, including differential diagnosis — generating a list of possible diagnoses based on symptoms — followed by ordering diagnostic tests, making a final diagnosis and planning treatment.
3. Differential diagnosis was the weakest area across all 21 models tested, with failure rates exceeding 80% and reaching 100% for some models in certain scenarios. The researchers noted this weakness was consistent with a prior study by some of the same authors, suggesting newer AI versions have not resolved the problem.
4. Failure rates for final diagnosis were less than 40% across all models. When given more information and prompted to give a final diagnosis, failure rates declined to as low as 9% for the best-performing models.
5. The authors said current models lack the reasoning processes needed for safe clinical use and concluded that the most responsible use today is targeted, clinician-supervised use in low-uncertainty tasks.
The post AI chatbots miss initial diagnoses 80% of the time: Mass General Brigham study appeared first on Becker's Hospital Review | Healthcare News & Analysis.
Source: Read Original Article
