The Universality principle states that a medical AI tool should be generalisable outside the controlled environment where it was built. Specifically, the AI tool should be able to generalise to new patients and users, and when applicable, to new clinical sites. Depending on the intended radius of application, medical AI tools should be as interoperable and as transferable as possible, so they can benefit citizens and clinicians at scale.
The FUTURE-AI framework defines four key recommendations for Universality, with varying levels of compliance requirements for research and deployable tools. Defining intended clinical settings and cross setting variations is highly recommended (++) for both research and deployable tools, requiring teams to specify healthcare settings, resource needs, and identify potential variations across different environments. The use of community defined standards, such as clinical definitions and technical standards, is recommended (+) for both research and deployable applications, promoting interoperability and standardization.
External evaluation using multiple datasets and/or sites is highly recommended (++) for both research and deployable tools, ensuring broad generalizability. The framework places particular emphasis on evaluating and demonstrating local clinical validity, with a higher requirement for deployable tools (++) compared to research applications (+), ensuring that AI tools perform effectively within specific local contexts and workflows.
This graduated approach to compliance requirements reflects the framework’s recognition that while all aspects of universality are important, certain elements become critically important when moving from research to deployment, particularly in ensuring local clinical validity and real-world effectiveness.
Recommendations | Operations | Examples |
---|---|---|
Define intended clinical settings and cross setting variations (universality 1) | Define the AI tool’s healthcare setting(s) | Primary care, hospital, remote care facility, home care |
Define the resources needed at each setting | Personnel (experience, digital literacy), medical equipment (eg, >1.5 T MRI scanner), IT infrastructure | |
Specify if the AI tool is intended for high end and/or low resource settings | Facilities with MRI scanners >1.5 T v low field MRIs (eg, 0.5 T), high end v low cost portable ultrasound | |
Identify all cross settings variations | Data formats, medical equipment, data protocols, IT infrastructure | |
Use a standard definition for the clinical task | Definition of heart failure by the American Academy of Cardiology | |
Use community defined standards (universality 2) | Use a standard method for data labelling | BI-RADS for breast imaging |
Use a standard ontology for the AI inputs | DICOM for imaging data, SNOMED for clinical data | |
Adopt technical standards | IEEE 2801-2022 for medical software | |
Use standard evaluation criteria | See Maier-Hein et al for medical imaging applications, Barocas et al and Bellamy et al for fairness evaluation | |
Evaluate using external datasets and multiple sites (universality 3) | Identify relevant public datasets | Cancer Imaging Archive, UK Biobank, M&Ms, MAMA-MIA, BRATS |
Identify external private datasets | New prospective dataset from same site or from different clinical centre | |
Select multiple evaluation sites | Three sites in same country, five sites in two different countries | |
Verify that evaluation data and sites reflect real world variations | Variations in demographics, clinicians, equipment | |
Confirm that no evaluation data were used during training | Yes/no | |
Evaluate and demonstrate local clinical validity (universality 4) | Test AI model using local data | Data from local clinical registry |
Identify factors that could affect AI tool’s local validity | Local operators, equipment, clinical workflows, acquisition protocols | |
Assess AI tool’s integration within local clinical workflows | AI tool’s interface aligns with hospital IT system or disrupts routine practice | |
Assess AI tool’s local practical utility and identify any operational challenges | Time to operate, clinician satisfaction, disruption of existing operations | |
Implement adjustments for local validity | Model calibration, fine-tuning, transfer learning | |
Compare performance of AI tool with that of local clinicians | Side-by-side comparison, in silico trial |