- Researchers design a more realistic test to evaluate AI’s clinical communication skills.
- Enhancing AI models like CRAFT-MD can forge a path toward integrating these technologies into clinical practice in an effective and ethical manner.
The advent of artificial intelligence (AI) tools, particularly large language models such as ChatGPT, has generated considerable optimism regarding their potential to alleviate clinician workload in healthcare.
These models are envisioned to streamline processes by triaging patients, collecting medical histories, and even providing preliminary diagnoses.
However, despite their impressive performance on standardised medical assessments, recent research conducted by Harvard Medical School and Stanford University reveals significant shortcomings when these models are subjected to scenarios that more accurately reflect real-world clinical interactions.
The study by researchers at Harvard Medical School and Stanford University introduced an evaluation framework known as the Conversational Reasoning Assessment Framework for Testing in Medicine (CRAFT-MD).
The framework was designed to assess the performance of four large language models in simulated clinical conversations, thereby providing a more realistic measure of their capabilities.
While these models excelled in answering questions akin to those found on medical board exams, their performance deteriorated markedly when engaged in dynamic, conversational exchanges that mirror actual patient-clinician interactions.
CRAFT-MD framework
The findings highlight a critical gap in the evaluation of AI models for clinical use. The traditional method of assessing AI performance through multiple-choice questions fails to capture the complexities inherent in real-world medical dialogues.
As noted by study co-first author Shreya Johri, this approach presupposes that all relevant information is presented in a clear and concise manner, a scenario rarely encountered in actual clinical practice.
“The CRAFT-MD framework aims to rectify this by evaluating the models’ abilities to gather information about symptoms, medications, and family history in a conversational context, thereby enhancing the realism of the assessment.”
The study’s senior author, Pranav Rajpurkar, emphasises the paradox that while AI models may perform admirably on standardised tests, they struggle with the nuanced, back-and-forth interactions typical of a doctor’s visit.
“The dynamic nature of medical conversations—necessitating timely inquiries, the synthesis of disparate information, and reasoning through symptoms—presents challenges that extend beyond mere factual recall or multiple-choice responses.”
The researchers used CRAFT-MD to test four AI models — both proprietary and commercial and open-source ones — for performance in 2,000 clinical vignettes featuring conditions common in primary care and across 12 medical specialties.
The applications of AI in clinical settings reveal significant limitations, particularly in conducting nuanced medical conversations. These limitations manifest in the models’ inability to coherently take medical histories and formulate accurate diagnoses.
Targeted recommendations
A closer examination indicates that AI systems frequently falter in eliciting relevant patient information through appropriate questioning, inadvertently neglecting vital clues that could inform clinical decisions. Furthermore, these models demonstrate a decline in performance when tasked with analysing open-ended information as opposed to structured multiple-choice inputs.
The propensity for error is exacerbated in dynamic, back-and-forth dialogues, which better resemble the complexity of actual patient interactions.
To enhance the utility of AI in real-world medical scenarios, a set of targeted recommendations is proposed for both developers and regulators of AI technologies.
First and foremost, training should prioritise the use of conversational, open-ended questions that accurately reflect the organic exchanges observed in doctor-patient interactions. Furthermore, there is an imperative need to rigorously assess the ability of these models to ask pertinent questions and efficiently extract critical information from patients.
In addition, AI systems should be designed to juggle multiple conversational threads, synthesizing and integrating disparate information from various interactions.
Enhancing AI models
Importantly, the capacity to unify both textual notes and non-textual data—such as imagery or electrocardiograms (EKGs)—will significantly enhance diagnostic accuracy. Incorporating capabilities to interpret non-verbal cues, including facial expressions, tone, and body language, would further elevate the effectiveness of these AI agents.
Moreover, the standard evaluation process must shift to encompass both AI models and human experts. Relying solely on human evaluators is resource-intensive and costly.
For instance, the CRAFT-MD model demonstrated remarkable efficiency by processing 10,000 conversations within 48 to 72 hours, a task that would otherwise demand extensive human resources, spanning hundreds of hours for simulations and evaluations. Deploying AI for preliminary evaluations not only conserves human capital but also mitigates the risks associated with trials involving real patients and unverified AI tools.
Ultimately, the commitment to enhancing AI models like CRAFT-MD can forge a path toward integrating these technologies into clinical practice in an effective and ethical manner.
As noted by Roxana Daneshjou, a prominent figure in the realm of biomedical data science, adopting frameworks that closely mirror genuine healthcare interactions will catalyse the progress of AI model testing and performance in the healthcare landscape. In doing so, the potential for AI to augment clinical practice may be actualised, paving the way for improved patient outcomes across medical disciplines.