Evaluation Phase

The evaluation phase of FUTURE-AI framework encompasses eight comprehensive recommendations focused on rigorous assessment of AI tools across multiple dimensions of trustworthiness. The following table provides detailed operations and examples for evaluating various aspects including Fairness (bias evaluation), Universality (external validation), Robustness (real-world variations testing), Usability (user experience and clinical utility assessment), Explainability (meaningful explanations verification), and documentation (technical and clinical documentation).

Practical steps and examples to implement FUTURE-AI recommendations during evaluation phase
Recommendations Operations Examples
Define adequate evaluation plan (general 4) Identify the dimensions of trustworthy AI to be evaluated Robustness, clinical safety, fairness, data drifts, usability, explainability
Select appropriate testing datasets External dataset from a new hospital, public benchmarking dataset
Compare the AI tool against standard of care Conventional risk predictors, visual assessment by radiologist, decision by clinician
Select adequate evaluation metrics F1 score for classification, concordance index for survival, statistical parity for fairness
Evaluate using external datasets and/or multiple sites (universality 3) Identify relevant public datasets Cancer Imaging Archive, UK Biobank, M&Ms, MAMA-MIA, BRATS
Identify external private datasets New prospective dataset from same site or from different clinical centre
Select multiple evaluation sites Three sites in same country, five sites in two different countries
Verify that evaluation data and sites reflect real world variations Variations in demographics, clinicians, equipment
Confirm that no evaluation data were used during training Yes/no
Evaluate fairness and bias correction measures (fairness 3) Select attributes and factors for fairness evaluation Sex, age, skin colour, comorbidity
Define fairness metrics and criteria Statistical parity difference defined fairness between −0.1 and 0.1
Evaluate fairness and identify biases Fair with respect to age, biased with respect to sex
Evaluate bias mitigation measures Training data resampling, equalised odds postprocessing
Evaluate impact of mitigation measures on model performance Data resampling removed sex bias but reduced model performance
Report identified and uncorrected biases In AI information leaflet and technical documentation
Evaluate user experience (usability 4) Evaluate usability with diverse end users According to sex, age, digital proficiency level, role, clinical profile
Evaluate user satisfaction using usability questionnaires System usability scale
Evaluate user performance and productivity Diagnosis time with and without AI tool, image quantification time
Assess training of new end users Average time to reach competency, training difficulties
Evaluate clinical utility and safety (usability 5) Define clinical evaluation plan Randomised control trial, in silico trial
Evaluate if AI tool improves patient outcomes Better risk prevention, earlier diagnosis, more personalised treatment
Evaluate if AI tool enhances productivity or quality of care Enhanced patient triage, shorter waiting times, faster diagnosis, higher patient intake
Evaluate if AI tool results in cost savings Reduction in diagnosis costs, reduction in overtreatment
Evaluate AI tool’s safety Side effects or major adverse events in randomised control trials
Evaluate robustness (robustness 3) Evaluate robustness under real world variations Using test-retest datasets, multivendor datasets
Evaluate robustness under simulated variations Using simulated repeatability tests, synthetic noise and artefacts (eg, image blurring)
Evaluate robustness against variations in end users Different technicians or annotators
Evaluate mitigation measures for robustness enhancement Regularisation, data augmentation, noise addition, normalisation, resampling, domain adaptation
Evaluate explainability (explainability 2) Assess if explanations are clinically meaningful Reviewing by expert panels, alignment to current clinical guidelines, explanations not pointing to shortcuts
Assess explainability quantitatively using objective measures Fidelity, consistency, completeness, sensitivity to noise
Assess explainability qualitatively with end users Using user tests or questionnaires to measure confidence and affect clinical decision making
Evaluate if explanations cause end user overconfidence or overreliance Measure changes in clinician confidence, performance with and without AI tool
Evaluate if explanations are sensitive to input data variations Stress tests under perturbations to evaluate the stability of explanations
Provide documentation (traceability 2) Report evaluation results in publication using AI reporting guidelines Peer reviewed scientific publication using TRIPOD-AI reporting guideline
Create technical documentation for AI tool AI passport, model cards (including model hyperparameters, training and testing data, evaluations, limitations, etc)
Create clinical documentation for AI tool Guidelines for clinical use, AI information leaflet (including intended use, conditions and diseases, targeted populations, instructions, potential benefits, contraindications)
Provide risk management file Including identified risks, mitigation measures, monitoring measures
Create user and training documentation User manuals, training materials, troubleshooting, FAQs (see usability 2)
Identify and provide all locally required documentation Compliance documents and certifications (see general 5)