{shortcode-11d77b654bf9757ceeb254a1aab278ba7da78511}
Researchers at Harvard Medical School found that a new open-source artificial intelligence tool is diagnosing patients as accurately as leading proprietary models — like OpenAI’s GPT-4 — for the first time.
In a study published two weeks ago, open-source model Llama 3.1 405B was tested against GPT-4 on a set of 70 cases, beating out GPT-4 on both the correctness of the first suggestion and the final diagnosis.
The open-sourced tool allows users to download and manipulate the model. This means that hospitals can use the technology to input information about their patient’s symptoms without exposing private data to a wider network for the first time.
According to Thomas A. Buckley, a student in Harvard’s AI in Medicine Ph.D. program and first author of the study, “open-source models unlock new scientific research because they can be deployed in a hospital’s own network”.
One major implication of the study is that researchers “can now use state-of-the-art clinical AI directly with patient data,” wrote Buckley. “Hospitals can use patient data to develop custom models (for example, to align with their own patient population).”
Arjun K. Manrai ’08, a professor at HMS who supervised the study, called the fact that open source models could perform on par with GPT-4 “pretty remarkable.” But he noted that medical researchers previously struggled to use the model “in the real world” because of concerns about patient privacy.
Buckley said he was drawn to the topic by a 2023 paper that highlighted GPT’s performance on some of the most difficult cases from the New England Journal of Medicine.
“This paper got a ton of attention and basically showed that this large language model, ChatGPT, could somehow solve these incredibly challenging clinical cases, which kind of shocked people,” said Buckley.
“These cases are notoriously difficult,” he added. “They’re some of the most challenging cases seen at the Mass General Hospital, so they’re scary to physicians, and it’s equally scary when an AI model could do the same thing.”
Buckley said that when Llama 3.1 405B was released last year, the sheer magnitude of its 405 billion parameters — or values the model uses to make predictions — seemed to have a lot of potential.
“It was kind of the first time where we considered, oh, maybe there’s something really different happening in open-source models,” he said.
Manrai said the research “unlocks and opens up a lot of new studies and trials,” since it allows hospitals to safely use patient data in real time.
“With these open source models, you can bring the model to the data, as opposed to sending your data to the model,” Manrai said.
But Manrai and Buckley said that while the AI diagnostic process is promising, humans are still needed to verify the results.
“Our clinical collaborators have been really important, because they can read what the model generates and assess it qualitatively,” Buckley said. “These results are only trustworthy when you can have them assessed by physicians.”
Manrai said in a March 14 press release that AI tools “could be invaluable copilots for busy clinicians” when “used wisely and incorporated responsibly in current health infrastructure.”
“But it remains crucial that physicians help drive these efforts to make sure AI works for them,” he added.
—Staff writer Kaitlyn Y. Choi can be reached at kaitlyn.choi@thecrimson.com.
—Staff writer Sohum M. Sukhatankar can be reached at sohum.sukhatankar@thecrimson.com. Follow him on X @ssukhatankar06.