HomeMalware & ThreatsOff-the-Shelf LLMs Unprepared for Clinical Use

Off-the-Shelf LLMs Unprepared for Clinical Use

Published on

spot_img

Artificial Intelligence & Machine Learning,
Healthcare,
Industry Specific

Chatbots Getting Better Making Final Diagnoses, But Clinical Reasoning Still Weak

Off-the-Shelf LLMs Unprepared for Clinical Use
Image: Collagery/Shutterstock

A recent study conducted by Mass Gen Brigham has unveiled significant insights regarding the capabilities of general-purpose large language model (LLM) chatbots in the field of healthcare. The research aimed to assess the performance of these AI models in making final diagnoses. While the findings indicate improvements in the ability of chatbots to identify final diagnoses, they also reveal persistent weaknesses in clinical reasoning. Specifically, these models struggle to effectively rule out alternative conditions and potential causes of symptoms, prompting health researchers to voice concerns.

In the study, researchers tasked 21 different general-purpose artificial intelligence LLMs with “playing doctor” in various clinical scenarios. Among the models tested were some of the latest iterations available, such as GPT-5, Claude 4.5 Opus, Gemini 3.0 Flash and Pro, and Grok 4. The researchers enlisted the help of three medical students who assessed and scored the chatbots based on their performance across sequential stages of the standard clinical workflow. They focused this assessment on five specific domains: differential diagnosis, diagnostic testing, final diagnosis, management, and miscellaneous clinical reasoning questions. This study was carried out over the course of a year, from January 2025 to December 2025.

The methodology employed in the research involved feeding each of the AI models a series of clinical vignettes based on 29 published clinical cases. To mimic the process through which clinical cases naturally evolve, the researchers gradually supplied the models with information about each case, starting with basic details such as patient age, gender, and presenting symptoms. As the scenarios unfolded, additional data was provided, including findings from physical examinations and laboratory results.

Notably, the study highlighted significant limitations in key clinical areas, particularly in differential diagnosis and diagnostic testing. Differential diagnosis is a core function that physicians perform, involving the process of ruling out various possible diagnoses in scenarios where symptoms may align with multiple conditions. Alarmingly, the AI models tested in the research failed to produce an appropriate differential diagnosis 80% of the time. In contrast, their performance in arriving at a final diagnosis, which involves determining the condition based on test results aimed at excluding alternatives, was comparatively better, achieving a correct final diagnosis 90% of the time.

Arya Rao, the lead author of the study and an MD-PhD student at Harvard Medical School, explained the rationale behind this research. “The intention was to place these models in the role of a doctor,” said Rao. “What was evident is that these models excel in identifying a final diagnosis once presented with a complete set of data. However, they encounter difficulties during the open-ended beginning stages of a case when the information available is sparse. Unlike clinicians, who do not prematurely close off possibilities until the iterative process of differential diagnosis leads to a conclusion, these models tend to fixate on single answers too quickly—a limitation that persists across various generations of models.”

The study concluded with a cautionary note regarding the use of these AI models in clinical settings, voicing concerns that such systems are not yet equipped for frontline decision-making. Researchers emphasized that while commendable performance in final diagnosis tasks might give the impression that these models are suitable for patient-facing clinical applications, the consistent failures in effectively generating differential diagnoses highlight a critical flaw. This limitation renders LLMs untrustworthy for immediate clinical use.

In light of these findings, researchers have advised healthcare practitioners to restrict the use of AI models to supervised tasks characterized by minimal uncertainty. This approach aligns with the insights from a separate report released earlier this year by the patient safety research organization ECRI Institute, which identified AI chatbots as the leading health technology hazard for 2026. These revelations prompt a reevaluation of the deployment of AI technologies within clinical settings, underscoring the necessity for robust oversight and cautious integration into patient care.

Source link

Latest articles

CISA Alerts on Apache ActiveMQ Vulnerability

The Cybersecurity and Infrastructure Security Agency (CISA) has recently issued a critical alert concerning...

Microsoft Addresses Reboot Loop Issue on Windows Servers After April Patches

Microsoft Addresses Issues with Windows Server 2025 Domain Controllers Following April 2026 Update Microsoft has...

Fake Zoom SDK Update Spreads Sapphire Sleet Malware

A newly identified cyber campaign targeting macOS users has emerged, attributed to the North...

$13.74M Hack Closes Sanctioned Grinex Exchange Following Intelligence Allegations

Grinex Suspends Operations After Major Cyber Attack Grinex, a cryptocurrency exchange based in Kyrgyzstan, has...

More like this

CISA Alerts on Apache ActiveMQ Vulnerability

The Cybersecurity and Infrastructure Security Agency (CISA) has recently issued a critical alert concerning...

Microsoft Addresses Reboot Loop Issue on Windows Servers After April Patches

Microsoft Addresses Issues with Windows Server 2025 Domain Controllers Following April 2026 Update Microsoft has...

Fake Zoom SDK Update Spreads Sapphire Sleet Malware

A newly identified cyber campaign targeting macOS users has emerged, attributed to the North...