Define adequate evaluation plan (general 4) |
Identify the dimensions of trustworthy AI to be evaluated |
Robustness, clinical safety, fairness, data drifts, usability, explainability |
Select appropriate testing datasets |
External dataset from a new hospital, public benchmarking dataset |
Compare the AI tool against standard of care |
Conventional risk predictors, visual assessment by radiologist, decision by clinician |
Select adequate evaluation metrics |
F1 score for classification, concordance index for survival, statistical parity for fairness |
Evaluate using external datasets and/or multiple sites (universality 3) |
Identify relevant public datasets |
Cancer Imaging Archive, UK Biobank, M&Ms, MAMA-MIA, BRATS |
Identify external private datasets |
New prospective dataset from same site or from different clinical centre |
Select multiple evaluation sites |
Three sites in same country, five sites in two different countries |
Verify that evaluation data and sites reflect real world variations |
Variations in demographics, clinicians, equipment |
Confirm that no evaluation data were used during training |
Yes/no |
Evaluate fairness and bias correction measures (fairness 3) |
Select attributes and factors for fairness evaluation |
Sex, age, skin colour, comorbidity |
Define fairness metrics and criteria |
Statistical parity difference defined fairness between −0.1 and 0.1 |
Evaluate fairness and identify biases |
Fair with respect to age, biased with respect to sex |
Evaluate bias mitigation measures |
Training data resampling, equalised odds postprocessing |
Evaluate impact of mitigation measures on model performance |
Data resampling removed sex bias but reduced model performance |
Report identified and uncorrected biases |
In AI information leaflet and technical documentation |
Evaluate user experience (usability 4) |
Evaluate usability with diverse end users |
According to sex, age, digital proficiency level, role, clinical profile |
Evaluate user satisfaction using usability questionnaires |
System usability scale |
Evaluate user performance and productivity |
Diagnosis time with and without AI tool, image quantification time |
Assess training of new end users |
Average time to reach competency, training difficulties |
Evaluate clinical utility and safety (usability 5) |
Define clinical evaluation plan |
Randomised control trial, in silico trial |
Evaluate if AI tool improves patient outcomes |
Better risk prevention, earlier diagnosis, more personalised treatment |
Evaluate if AI tool enhances productivity or quality of care |
Enhanced patient triage, shorter waiting times, faster diagnosis, higher patient intake |
Evaluate if AI tool results in cost savings |
Reduction in diagnosis costs, reduction in overtreatment |
Evaluate AI tool’s safety |
Side effects or major adverse events in randomised control trials |
Evaluate robustness (robustness 3) |
Evaluate robustness under real world variations |
Using test-retest datasets, multivendor datasets |
Evaluate robustness under simulated variations |
Using simulated repeatability tests, synthetic noise and artefacts (eg, image blurring) |
Evaluate robustness against variations in end users |
Different technicians or annotators |
Evaluate mitigation measures for robustness enhancement |
Regularisation, data augmentation, noise addition, normalisation, resampling, domain adaptation |
Evaluate explainability (explainability 2) |
Assess if explanations are clinically meaningful |
Reviewing by expert panels, alignment to current clinical guidelines, explanations not pointing to shortcuts |
Assess explainability quantitatively using objective measures |
Fidelity, consistency, completeness, sensitivity to noise |
Assess explainability qualitatively with end users |
Using user tests or questionnaires to measure confidence and affect clinical decision making |
Evaluate if explanations cause end user overconfidence or overreliance |
Measure changes in clinician confidence, performance with and without AI tool |
Evaluate if explanations are sensitive to input data variations |
Stress tests under perturbations to evaluate the stability of explanations |
Provide documentation (traceability 2) |
Report evaluation results in publication using AI reporting guidelines |
Peer reviewed scientific publication using TRIPOD-AI reporting guideline |
Create technical documentation for AI tool |
AI passport, model cards (including model hyperparameters, training and testing data, evaluations, limitations, etc) |
Create clinical documentation for AI tool |
Guidelines for clinical use, AI information leaflet (including intended use, conditions and diseases, targeted populations, instructions, potential benefits, contraindications) |
Provide risk management file |
Including identified risks, mitigation measures, monitoring measures |
Create user and training documentation |
User manuals, training materials, troubleshooting, FAQs (see usability 2) |
Identify and provide all locally required documentation |
Compliance documents and certifications (see general 5) |