Universality

The Universality principle states that a medical AI tool should be generalisable outside the controlled environment where it was built. Specifically, the AI tool should be able to generalise to new patients and users, and when applicable, to new clinical sites. Depending on the intended radius of application, medical AI tools should be as interoperable and as transferable as possible, so they can benefit citizens and clinicians at scale.

The FUTURE-AI framework defines four key recommendations for Universality, with varying levels of compliance requirements for research and deployable tools. Defining intended clinical settings and cross setting variations is highly recommended (++) for both research and deployable tools, requiring teams to specify healthcare settings, resource needs, and identify potential variations across different environments. The use of community defined standards, such as clinical definitions and technical standards, is recommended (+) for both research and deployable applications, promoting interoperability and standardization.

External evaluation using multiple datasets and/or sites is highly recommended (++) for both research and deployable tools, ensuring broad generalizability. The framework places particular emphasis on evaluating and demonstrating local clinical validity, with a higher requirement for deployable tools (++) compared to research applications (+), ensuring that AI tools perform effectively within specific local contexts and workflows.

This graduated approach to compliance requirements reflects the framework’s recognition that while all aspects of universality are important, certain elements become critically important when moving from research to deployment, particularly in ensuring local clinical validity and real-world effectiveness.

Recommendations	Operations	Examples
Define intended clinical settings and cross setting variations (universality 1)	Define the AI tool’s healthcare setting(s)	Primary care, hospital, remote care facility, home care
	Define the resources needed at each setting	Personnel (experience, digital literacy), medical equipment (eg, >1.5 T MRI scanner), IT infrastructure
	Specify if the AI tool is intended for high end and/or low resource settings	Facilities with MRI scanners >1.5 T v low field MRIs (eg, 0.5 T), high end v low cost portable ultrasound
	Identify all cross settings variations	Data formats, medical equipment, data protocols, IT infrastructure
	Use a standard definition for the clinical task	Definition of heart failure by the American Academy of Cardiology
Use community defined standards (universality 2)	Use a standard method for data labelling	BI-RADS for breast imaging
	Use a standard ontology for the AI inputs	DICOM for imaging data, SNOMED for clinical data
	Adopt technical standards	IEEE 2801-2022 for medical software
	Use standard evaluation criteria	See Maier-Hein et al for medical imaging applications, Barocas et al and Bellamy et al for fairness evaluation
Evaluate using external datasets and multiple sites (universality 3)	Identify relevant public datasets	Cancer Imaging Archive, UK Biobank, M&Ms, MAMA-MIA, BRATS
	Identify external private datasets	New prospective dataset from same site or from different clinical centre
	Select multiple evaluation sites	Three sites in same country, five sites in two different countries
	Verify that evaluation data and sites reflect real world variations	Variations in demographics, clinicians, equipment
	Confirm that no evaluation data were used during training	Yes/no
Evaluate and demonstrate local clinical validity (universality 4)	Test AI model using local data	Data from local clinical registry
	Identify factors that could affect AI tool’s local validity	Local operators, equipment, clinical workflows, acquisition protocols
	Assess AI tool’s integration within local clinical workflows	AI tool’s interface aligns with hospital IT system or disrupts routine practice
	Assess AI tool’s local practical utility and identify any operational challenges	Time to operate, clinician satisfaction, disruption of existing operations
	Implement adjustments for local validity	Model calibration, fine-tuning, transfer learning
	Compare performance of AI tool with that of local clinicians	Side-by-side comparison, in silico trial