Researchers from the National Institutes of Health (NIH) have demonstrated an artificial intelligence (AI) system that excels at answering medical quiz questions.
A study by experts from NIH's National Library of Medicine (NLM) and Weill Cornell Medicine discovered that an AI designed to assess health professionals' diagnostic skills using clinical images and brief text summaries performed with high accuracy. However, physician-graders noted that the AI often made errors in describing images and explaining its reasoning process. These findings highlight AI's emerging potential in clinical settings.
"Integration of AI into health care holds great promise as a tool to help medical professionals diagnose patients faster, allowing them to start treatment sooner," said NLM Acting Director, Stephen Sherry, Ph.D. "However, as this study shows, AI is not advanced enough yet to replace human experience, which is crucial for accurate diagnosis."
The AI model and human physicians tackled questions from the New England Journal of Medicine (NEJM)'s Image Challenge, an online quiz featuring real clinical images and brief descriptions of patient symptoms. Participants had to select the correct diagnosis from multiple-choice options.
The AI model answered 207 image challenge questions and provided written rationales, including image descriptions, relevant medical knowledge, and step-by-step reasoning. Nine physicians from various medical specialties were recruited to answer these questions, first without external resources ("closed-book") and then with access to external materials ("open-book"). Researchers then compared the correct answers with those provided by the AI model and asked the physicians to evaluate the AI’s performance.
The results showed both the AI model and the physicians scored highly in selecting the correct diagnosis. Notably, the AI outperformed physicians in closed-book settings, while physicians with access to external resources performed better than the AI, particularly on more challenging questions.
Despite making the correct final choices, the AI model frequently made mistakes in describing medical images and explaining its reasoning. For instance, given a photo of a patient's arm with two lesions, the AI failed to recognise both lesions as caused by the same condition due to their different angles and appearances. A human physician would easily identify the connection between the lesions.
These findings highlight the need for further evaluation of multimodal AI technology before its clinical implementation. "This technology can potentially enhance clinicians' abilities with data-driven insights that could improve clinical decision-making," said Zhiyong Lu, Ph.D., NLM senior investigator and corresponding author of the study. "Understanding the risks and limitations of this technology is essential to harnessing its potential in medicine."
The study utilised GPT-4V, a multimodal AI model capable of processing text and images. While the study was small, it highlights the potential of multimodal AI to assist physicians in medical decision-making, although more research is necessary to compare its diagnostic abilities with those of human physicians.
Co-authors of the study included collaborators from NIH's National Eye Institute and the NIH Clinical Center; the University of Pittsburgh; UT Southwestern Medical Center, Dallas; New York University Grossman School of Medicine, New York City; Harvard Medical School and Massachusetts General Hospital, Boston; Case Western Reserve University School of Medicine, Cleveland; University of California San Diego, La Jolla; and the University of Arkansas, Little Rock.
A study by experts from NIH's National Library of Medicine (NLM) and Weill Cornell Medicine discovered that an AI designed to assess health professionals' diagnostic skills using clinical images and brief text summaries performed with high accuracy. However, physician-graders noted that the AI often made errors in describing images and explaining its reasoning process. These findings highlight AI's emerging potential in clinical settings.
"Integration of AI into health care holds great promise as a tool to help medical professionals diagnose patients faster, allowing them to start treatment sooner," said NLM Acting Director, Stephen Sherry, Ph.D. "However, as this study shows, AI is not advanced enough yet to replace human experience, which is crucial for accurate diagnosis."
The AI model and human physicians tackled questions from the New England Journal of Medicine (NEJM)'s Image Challenge, an online quiz featuring real clinical images and brief descriptions of patient symptoms. Participants had to select the correct diagnosis from multiple-choice options.
The AI model answered 207 image challenge questions and provided written rationales, including image descriptions, relevant medical knowledge, and step-by-step reasoning. Nine physicians from various medical specialties were recruited to answer these questions, first without external resources ("closed-book") and then with access to external materials ("open-book"). Researchers then compared the correct answers with those provided by the AI model and asked the physicians to evaluate the AI’s performance.
The results showed both the AI model and the physicians scored highly in selecting the correct diagnosis. Notably, the AI outperformed physicians in closed-book settings, while physicians with access to external resources performed better than the AI, particularly on more challenging questions.
Despite making the correct final choices, the AI model frequently made mistakes in describing medical images and explaining its reasoning. For instance, given a photo of a patient's arm with two lesions, the AI failed to recognise both lesions as caused by the same condition due to their different angles and appearances. A human physician would easily identify the connection between the lesions.
These findings highlight the need for further evaluation of multimodal AI technology before its clinical implementation. "This technology can potentially enhance clinicians' abilities with data-driven insights that could improve clinical decision-making," said Zhiyong Lu, Ph.D., NLM senior investigator and corresponding author of the study. "Understanding the risks and limitations of this technology is essential to harnessing its potential in medicine."
The study utilised GPT-4V, a multimodal AI model capable of processing text and images. While the study was small, it highlights the potential of multimodal AI to assist physicians in medical decision-making, although more research is necessary to compare its diagnostic abilities with those of human physicians.
Co-authors of the study included collaborators from NIH's National Eye Institute and the NIH Clinical Center; the University of Pittsburgh; UT Southwestern Medical Center, Dallas; New York University Grossman School of Medicine, New York City; Harvard Medical School and Massachusetts General Hospital, Boston; Case Western Reserve University School of Medicine, Cleveland; University of California San Diego, La Jolla; and the University of Arkansas, Little Rock.