Many areas of science have been facing a reproducibility crisis over the past two years, and machine learning and AI are no exception. That has been highlighted by recent efforts to identify papers with results that are reproducible and those that are not.
Two new analyses put the spotlight on machine learning in health research, where lack of reproducibility and poor quality is especially alarming. “If a doctor is using machine learning or an artificial intelligence tool to aid in patient care, and that tool does not perform up to the standards reported during the research process, then that could risk harm to the patient, and it could generally lower the quality of care,” says Marzyeh Ghassemi of the University of Toronto.
In a paper describing her team’s analysis of 511 other papers, Ghassemi’s team reported that machine learning papers in healthcare were reproducible far less often than in other machine learning subfields. The group’s findings were published this week in the journal Science Translational Medicine. And in a systematic review published in Nature Machine Intelligence, 85 percent of studies using machine learning to detect COVID-19 in chest scans failed a reproducibility and quality check, and none of the models was near ready for use in clinics, the authors say.
“We were surprised at how far the models are from being ready for deployment,” says Derek Driggs, co-author of the paper from the lab of Carola-Bibiane Schönlieb at the University of Cambridge. “There were many flaws that should not have existed.”
When the pandemic began, Schönlieb and colleagues formed a multidisciplinary team, the AIX-COVNET collaboration, to develop a model using chest X-rays to predict COVID-19 severity. Yet, following a literature review, the team found that many models appeared to include biases that should make them unfit for the clinic.
So instead of building their own model, the team dove deeper into the literature. “We realized the best way to help would be by setting rigid research standards that could help people develop models that could actually be useful to clinicians,” says Driggs. To determine what standards were needed, the team collected 2,212 machine learning studies and winnowed them down to 415 models for detecting or predicting COVID-19 infection from chest scans.
Of those 415, only 62 passed two standard reproducibility and quality checklists, CLAIM and RQS. “Many studies didn’t actually report enough of their methodology for their models to be recreated,” says Driggs. “This is a huge reproducibility issue.”
Of the remaining 62—including two currently in use in clinics—the team found that none were developed such that they could actually be deployed in medicine. Key issues were biases in study design and methodological flaws.
For example, 16 of the 62 studies used a dataset of images of children’s lungs as the healthy control—without mentioning it in the methodology—then tested the algorithms on images from adults with COVID-19, essentially training the model to tell the difference between children and adults, not healthy versus infected. Additionally, some models were trained on datasets too small to be effective or did not specify where the data came from.
At the University of Toronto, Ghassemi and colleagues evaluated 511 machine learning papers presented at machine learning conferences from 2017 to 2019. By hand, her team annotated each paper against a set of criteria for different types of reproducibility. In technical reproducibility—the ability to fully replicate code against the same dataset used by the authors—only 55 percent of machine learning in healthcare (MLH) papers made their code available and used public datasets as compared to 90 percent of computer vision and natural language processing papers.
“What’s worrying is that the datasets are not available,” says Ghassemi. “I did not realize quite how bad it was until we read through all the papers.”
In conceptual reproducibility—the ability to reproduce results with a different dataset—only 23 percent of MLH papers used multiple datasets to confirm their results, as compared with 80 percent of computer vision studies and 58 percent of natural language processing studies.
Healthcare is an especially challenging area for machine learning research because many datasets are restricted due to health privacy concerns and even experts may disagree on a diagnosis for a scan or patient. Still, researchers are optimistic that the field can do better.
“A lot of the issues we identified can easily be fixed,” says Driggs. Here are a few of their recommendations for doing so:
Form a multidisciplinary team: “There’s a disconnect in research standards between medical and machine learning communities,” says Driggs. While it is common to split up a single dataset into training and testing sets in machine learning, medical communities expect models to be validated on external datasets. Building a team of machine-learning researchers and clinicians can help bridge that gap—and assure a model is actually useful for doctors.
Make sure you use high quality data—and know its origins: This key element will fix many issues highlighted in the studies, says Driggs. Various teams, including Ghassemi’s, are developing such datasets for use, and some already exist, such as the Medical Information Mart for Intensive Care and the eICU Collaborative. “If we can create data that is diverse and representative, and allow it to be used in machine learning for health community…that’s going to be very powerful,” says Ghassemi.
Develop community standards: “Organizations that run conferences should put standards into place” that require rigorous and consistent data and reporting, says Ghassemi. Health organizations have data standards such as the Observational Medical Outcomes Partnership standard and the Fast Healthcare Interoperability Resources standard, but these are not yet commonly adopted in MLH research.