Universality

The Universality principle states that a medical AI tool should be generalisable outside the controlled environment where it was built. Specifically, the AI tool should be able to generalise to new patients and users, and when applicable, to new clinical sites. Depending on the intended radius of application, medical AI tools should be as interoperable and as transferable as possible, so they can benefit citizens and clinicians at scale.

The FUTURE-AI framework defines four key recommendations for Universality, with varying levels of compliance requirements for research and deployable tools. Defining intended clinical settings and cross setting variations is highly recommended (++) for both research and deployable tools, requiring teams to specify healthcare settings, resource needs, and identify potential variations across different environments. The use of community defined standards, such as clinical definitions and technical standards, is recommended (+) for both research and deployable applications, promoting interoperability and standardization.

External evaluation using multiple datasets and/or sites is highly recommended (++) for both research and deployable tools, ensuring broad generalizability. The framework places particular emphasis on evaluating and demonstrating local clinical validity, with a higher requirement for deployable tools (++) compared to research applications (+), ensuring that AI tools perform effectively within specific local contexts and workflows.

This graduated approach to compliance requirements reflects the framework’s recognition that while all aspects of universality are important, certain elements become critically important when moving from research to deployment, particularly in ensuring local clinical validity and real-world effectiveness.

Recommendations Operations Examples
Define intended clinical settings and cross setting variations (universality 1) Define the AI tool’s healthcare setting(s) Primary care, hospital, remote care facility, home care
Define the resources needed at each setting Personnel (experience, digital literacy), medical equipment (eg, >1.5 T MRI scanner), IT infrastructure
Specify if the AI tool is intended for high end and/or low resource settings Facilities with MRI scanners >1.5 T v low field MRIs (eg, 0.5 T), high end v low cost portable ultrasound
Identify all cross settings variations Data formats, medical equipment, data protocols, IT infrastructure
Use a standard definition for the clinical task Definition of heart failure by the American Academy of Cardiology
Use community defined standards (universality 2) Use a standard method for data labelling BI-RADS for breast imaging
Use a standard ontology for the AI inputs DICOM for imaging data, SNOMED for clinical data
Adopt technical standards IEEE 2801-2022 for medical software
Use standard evaluation criteria See Maier-Hein et al for medical imaging applications, Barocas et al and Bellamy et al for fairness evaluation
Evaluate using external datasets and multiple sites (universality 3) Identify relevant public datasets Cancer Imaging Archive, UK Biobank, M&Ms, MAMA-MIA, BRATS
Identify external private datasets New prospective dataset from same site or from different clinical centre
Select multiple evaluation sites Three sites in same country, five sites in two different countries
Verify that evaluation data and sites reflect real world variations Variations in demographics, clinicians, equipment
Confirm that no evaluation data were used during training Yes/no
Evaluate and demonstrate local clinical validity (universality 4) Test AI model using local data Data from local clinical registry
Identify factors that could affect AI tool’s local validity Local operators, equipment, clinical workflows, acquisition protocols
Assess AI tool’s integration within local clinical workflows AI tool’s interface aligns with hospital IT system or disrupts routine practice
Assess AI tool’s local practical utility and identify any operational challenges Time to operate, clinician satisfaction, disruption of existing operations
Implement adjustments for local validity Model calibration, fine-tuning, transfer learning
Compare performance of AI tool with that of local clinicians Side-by-side comparison, in silico trial