Robustness

The Robustness principle refers to the ability of a medical AI model to maintain its performance and accuracy when it is applied under highly variable conditions in the real world, outside the controlled environment of the laboratory where the algorithm was built. Medical variations are an integral part of real-world radiology, and given the differences in clinical practices between radiology departments within as well as across centres and countries, it is important to implement preventive and corrective measures to enhance the robustness of the AI algorithms against changing clinical conditions. The robustness of a model is, hence, defined by its capability to generalise and predict well even in the presence of variable conditions causing domain and dataset shifts that may or may not have been anticipated before deployment. To assess and achieve robustness of medical AI algorithms, we propose the following recommendations:

  1. Data heterogeneity: The AI algorithms should be trained on heterogeneous datasets from multiple devices, protocols and clinical centres (e.g. using federated learning)  in order to increase the robustness against heterogeneous data distributions.
  2. Data augmentation: Data augmentation techniques (e.g. spatial augmentation, intensity based augmentation, image synthesis, generative learning) should be used to enhance the training and generalisability of the AI models.
  3. Image harmonisation: Image harmonisation techniques (e.g. histogram normalisation, adverserial learning, bias field correction, image resampling) should be considered to minimise the effects of image variations across centres, scanners and protocols.
  4. Feature variability: For feature-based medical AI, variable reproducibility and harmonisation (e.g. ComBat) should be investigated to assess, minimise and report feature variability.
  5. Anatomical variability: The AI developers should carefully verify that the training and evaluation datasets have adequate anatomical variability, such as in terms of body/organ size, amount of body fat, tissue density, and position in scanner.
  6. Operator variability: The evaluation studies should investigate the robustness of the AI model against intra- and inter-operator variations in the image scans (e.g. ultrasound) or in the clinical annotations (lesion delineations), such as due to differences in experience.
  7. Quality control: Techniques for image and segmentation quality control should be an integral part of the AI workflow, to identify low-quality images (e.g. with artifacts and/or inhomogeneities) or potential segmentation errors.
  8. Human-in-the-loop: Human-in-the-loop mechanisms should be implemented as part of the AI technology to ensure contextual changes and AI errors can be flagged and used to fine-tune the AI models.
  9. Uncertainty estimation: Medical AI models should be developed together with uncertainty estimates to provide confidence scores to the clinicians and flag cases with reduced robustness.
  10. Low-resource settings: When possible, the AI models should be evaluated and fine-tuned with clinical datasets from resource-limited sites and countries, to verify robustness and transferability across different settings.