A study suggests that artificial intelligence tools used to diagnose images may need a wider range of images to train on than previously thought.
The research, conducted at the Icahn School of Medicine at the Mount Sinai Health System, found that AI tools trained on only one organization’s images may suffer declines in performance and accuracy when tested on data from other healthcare organizations.
Such a finding may throw a curve in the push to use AI more broadly in radiology and other medical disciplines, suggesting that a wider knowledge base might be needed to ensure tools are sufficiently trained to work on data from a range of organizations.
Results of the test conducted at the school were published this week in a special issue of PLOS Medicine on machine learning in healthcare. Researchers say the findings suggest that artificial intelligence in the medical space must be carefully tested for performance across a wide range of populations—otherwise, the deep learning models may not generalize to new data.
The study focused on AI tools used to detect pneumonia on chest X-rays; the researchers assessed how AI models identified pneumonia in 158,000 chest X-rays taken across three medical institutions—the National Institutes of Health, The Mount Sinai Hospital and Indiana University Hospital. Researchers said they selected pneumonia diagnosed through chest X-rays because of its common occurrence, clinical significance and prevalence.
In three out of five comparisons, the performance of convolutional neural networks in analyzing medical imaging for diagnosing diseases on X-rays from hospitals outside of its own network was significantly lower than on X-rays from the original health system. For example, rules applied on images acquired at Mount Sinai were less accurate when they were used at one of the other organizations.
However, the convolutional neural networks (CNN) were able to detect the hospital system where an X-ray was acquired with a high-degree of accuracy and “cheated” at their predictive task, based on the prevalence of pneumonia at the training institution.
Researchers found that the difficulty of using deep learning models in medicine is that they use a massive number of parameters, making it challenging to identify which specific variables drive predictions, such as the types of CT scanners used at a hospital and the resolution quality of imaging.
“Our findings should give pause to those considering rapid deployment of artificial intelligence platforms without rigorously assessing their performance in real-world clinical settings reflective of where they are being deployed,” says senior author Eric Oermann, MD, instructor in neurosurgery at the Icahn School of Medicine at Mount Sinai. “Deep learning models trained to perform medical diagnosis can generalize well, but this cannot be taken for granted, (because) patient populations and imaging techniques differ significantly across institutions.”
“If CNN systems are to be used for medical diagnosis, they must be tailored to carefully consider clinical questions, tested for a variety of real-world scenarios and carefully assessed to determine how they impact accurate diagnosis,” says first author John Zech, a medical student at the Icahn School of Medicine at Mount Sinai.