Robustness

The Robustness principle refers to the ability of a medical AI model to maintain its performance and accuracy when it is applied under highly variable conditions in the real world, outside the controlled environment of the laboratory where the algorithm was built. Medical variations are an integral part of real-world radiology, and given the differences in clinical practices between radiology departments within as well as across centres and countries, it is important to implement preventive and corrective measures to enhance the robustness of the AI algorithms against changing clinical conditions. The robustness of a model is, hence, defined by its capability to generalise and predict well even in the presence of variable conditions causing domain and dataset shifts that may or may not have been anticipated before deployment. To assess and achieve robustness of medical AI algorithms, we propose the following recommendations:

  1. Application-specific robustness: Definitions and requirements for robustness should be compiled for each AI application. At the design phase, the development teams should analyse the factors that may impact the tool’s robustness in real-world practice (e.g. differences in imaging scanners across centres) and accordingly define mitigation measures.
  2. Real-world relevance: AI models should be trained on representative real-world data as defined by the domain experts
  3. Robustness tests: AI tools should be tested for robustness against real-world variations. The performance of the AI tool should be tested under varying conditions, such as due to variations in equipment, operators or centres.
  4. Robustness enhancement: AI models should be enhanced with concrete mechanisms to increase their robustness whenever needed. For example, the robustness of AI models can be improved using data augmentation, domain adaptation, transfer learning and/or domain distillation, depending on the AI application and the specific limitations.
  5. Human oversight: AI tools should integrate mechanisms for human oversight and feedback. This will allow to increase robustness by providing mechanisms to the end-users to flag and correct errors. For each application, the stages during which human judgement is deemed desirable should be defined, e.g. quality control, data annotations, accuracy checks, AI explainability or feedback about identified errors.