We translated the FUTURE-AI recommendations into an assessment checklist composed of concrete and actionable questions that will support developers, evaluators and other stakeholders in delivering medical AI tools that are trustworthy and optimised for real-world practice. Each element of the FUTURE-AI checklist is provided with examples to illustrate potential mitigation measures for minimising the risks of medical AI algorithms based on the FUTURE-AI guiding principles. More examples of appropriate and concrete measures for producing trustworthy AI solutions in medicine and healthcare are described in detail in the FUTURE-AI paper. While not all questions of the FUTURE-AI checklist are applicable in every application and scenario, we encourage the AI teams to build on as many as elements as possible to produce medical AI algorithms with solid foundations towards their trustworthiness and adoption.
|1||Multi-disciplinarity||Did you design your AI algorithm with a diverse team of stakeholders? Did you collect requirements from a diverse set of end-users?||1,2||Yes, I designed my algorithm with radiologists, oncologists, hospital administrators, and ethicists.
Yes, I gathered the requirements from 5 male and 5 female patients, as well as from 5 early-career and 5 experienced radiologists.
|2||Definition of fairness||Did you define fairness for your specific clinical application? Did you ask clinicians about hidden sources of data imbalance?||1,2,4||Yes, I identified sex and socio-economic status as potential sources of bias in my AI application. Furthermore, the clinicians informed me of a potential data imbalance related to breast density (images with high breast density are under-represented)|
|3||Metadata labelling||At data collection, did you record key metadata variables on individuals and groups?||4||Yes, I recorded four variables: Sex, age, locality and comorbidities.|
|4||Estimation of data (im)balance||Did you inspect and ensure the diversity of the training and evaluation data?||5, 6||Yes, I estimated the data distribution per sex, ethnicity and country. I used over-sampling to correct for data imbalance with respect to ethnicity (fewer non-White patients in the imaging sample). I talked to radiologists and used the AI Fairness 360 software to identify hidden sources of data bias.|
|5||Multi-centre datasets||Did you train and assess your algorithm for multi-centre data samples? Does your algorithm maintain accuracy across radiology units and geographical locations? In particular, is it applicable in centres with reduced data quality (e.g. in resource-limited countries)?||4,5,6||Yes, I tested my Covid-19 detection algorithm in four centres in Spain, Germany, Turkey and Ghana. I found that transfer learning helps the algorithm maintain its accuracy when tested on low-cost imaging samples.|
|6||Fairness evaluation and metrics||Did you thoroughly evaluate the fairness of your AI algorithm? Did you use a suitable dataset and dedicated metrics? Do you have a mechanism for continuous evaluation of your algorithm’s fairness?||6||Yes, during evaluation I used a multi-centre dataset with both male and female subjects from three ethnicities (White, Black and Chinese subgroups).
I estimated and reported True Positive Rates (TPR), Statistical Parity and Group Fairness.
I built a mechanism for periodic evaluation of the fairness with a reference dataset (every six months).
|7||Fairness corrective measures||If you identified sources of bias in the data, did you implement mitigation measures?||5,6||Yes, I implemented and tested pre-processing (data re-sampling) as well as post-processing (equalised odds) techniques. I also implemented transfer learning and domain adaptation to improve generalisability to low-to-middle income countries.|
|8||Continuous monitoring of fairness||Do you have a mechanism for continuous testing of your algorithm’s fairness over its lifetime?||6||Yes, I prepared a reference dataset with a diversity of patients and images, and defined an automated periodic test (every six months) as part of the algorithm’s monitoring tool.|
|9||Information and training on fairness||Did you prepare information and training material for the radiologists and clinicians, to inform on potential biases and maximise fairness during the algorithm’s use?||6||Yes.|
|10||Definition of clinical task||Did you use a universal definition of the clinical task?||1,2||Yes, I used the clinical definition of heart failure by the European Society of Cardiology.|
|11||Software standardisation||Did you design and implement the medical AI solution using proven libraries and framework standards that readily allow for extension and maintenance?||2,3,5||Yes, I designed and implemented a modular, object-oriented and readily extendable imaging AI solution using the Python programming language and well-established AI libraries such as Pytorch and TensorFlow.|
|12||Image annotation standardisation||Did you annotate your dataset in an objective, reproducible and standardised way?||3,4||Yes, after discussing the clinical requirements with the end-users, instead of circling and bounding boxes, organ delineation annotations were chosen as annotation type to enable the training of imaging AI segmentation models. The images were then independently annotated by three radiologists using the 3D Slicer annotation software.|
|13||Variation of quantified biomarkers||Are the methods you used for feature quantification compliant with consensus provided by standards initiatives?||3,4||Yes, I extracted IBSI compliant radiomics features using the pyradiomics library.|
|14||Evaluation metric selection and reporting||Did you use universal, transparent, comparable, and reproducible criteria and metrics for your model’s performance assessment?||6||Yes, I reported the Dice Similarity Coefficient, the Hausdorff distance, the Jaccard index, and the accuracy of my segmentation model and each of its variations/ablations in my experiments. The metrics are reported for each segmented anatomy/pathology of interest, but also in an accumulated manner across all anatomies/pathologies.|
|15||Reference dataset evaluation||Did you evaluate your model on at least one open access benchmark dataset that is representative of your model’s task and expected real-world data exposure after deployment?||6||Yes, I evaluated my brain MRI (3D, T1 scans) segmentation model on the publicly available ADNI and BRATS datasets and reported the computed evaluation metrics for each of these datasets.|
|16||Reporting standards compliance||Did you adhere to a standardised reporting guideline when assessing and communicating the design and findings of your study?||6||Yes, I thoroughly followed the TRIPOD-AI framework to report my findings and study design.|
|17||Model scope||Did you agree with the clinicians/radiologists a precise definition of the model’s scope? Did you precisely define the model’s intended use, the input data modalities, the necessary steps to provide the input to the AI model, the reference ground truth if any, the intended output and the use case scenarios? Did you check for any known limitations of the diagnostic/prognostic problem faced?||1,2,3||Yes, together with clinicians, I selected the most appropriate diagnostic modality to be used as input; we agreed upon the diagnostic question (e.g., estimating the tumour aggressiveness from mp-MRI data). We set the requirements for preparing the input (e.g., segmenting or pinpointing the lesion). We identified limitations and dependencies with the data acquisition parameters (e.g., to the β parameter of the DWI or the TC kernel or to the segmentation accuracy). We noted all these pieces of information in a structured report|
|18||Data provenance||Did you prepare a complete documentation of the dataset you used? Did you include the relevant DICOM tags? Did you structurally list the related clinical/genomic/pathology data?||4||Yes, I adopted and extended DataSheet as provenance standard for documenting data provenance and ownership, acquisition protocols, devices, and timing.|
|19||Data localization and distribution||Did you annotate the location of data over the network? Did you analyse dataset statistics with respect to the capability to represent the phenomenon at hand over the various clinical sites? Did you quantify missing values and any gaps or known biases?||4,5||Yes, I keep track using a structured identifier of the clinical sites each dataset belongs to and its location over the network. I analysed the data distribution over the various sites, thus identifying the diversity of populations (e.g., finding out that there is a high number of more aggressive tumour at clinical site X).|
|20||Data-preparation documentation||Did you keep track in a structured manner of the whole data pre-processing pipeline? Did you specify input/output, nature, prerequisites and requirements of your pre-processing and data preparation methods?||4,5||Yes, for instance, I included a detailed description of the cropping method used to extract a specific area around the region of interest.|
|21||Specification of clinical references||Did you include a clear description of the radiological/clinical standards or biomarkers used as reference? Did you include a complete record of the segmentation process, if any?||3,4,5||Yes, I explicitly specified the use of PIRADS – v2.1, also adding a link to its definition (https://www.acr.org/-/media/ACR/Files/RADS/Pi-RADS/PIRADS-V2-1.pdf).
I included a description of the segmentation and data labelling by specifying who performed the segmentation, their expertise and experience, the tools they used. I also carried out a stability check and included the results in terms of limitations
|22||Training recording||Did you record the details of the training process? Did you included a careful description of data and metadata?||5||Yes, I organized the training process using a framework that records all the attempts made. I documented all the process, by annotating the computing infrastructure and the training approach. I noted down the framework version, the libraries used and their versions and the computing facilities. I stored the initialization values of the weights for each attempt to ensure reproducibility. I choose a random optimization method and I recorded all the values tested.|
|23||Validation documentation||Did you document your validation process and the model selection approach agreed with clinicians?||6||Yes, I specified the evaluation metrics and the organization of the nested cross-validation we adopted. I agreed with the clinicians the model that best fit the initial scope (e.g., we selected the model with higher sensitivity as the model should serve screening purposes).|
|24||Final model details||Did you detail the characteristics of the final model released?||5,6,7||Yes, I described the model’s architecture, its interfaces, and its I/O data structures. I also included a description of its limitations with respect to the low specificity and the known points of failure related to data quality.|
|25||Traceability tool||Did you equipped your model with a traceability tool? Did you manage the dynamics of your model?||7||Yes, I developed a monitoring tool able to track the live functioning of the model. The tool records the main statistics of the model, also in light of new input coming and feedback coming from clinicians and warns about possible deviations or performance degradation. I also provide a precise plan for model update and re-training.|
|26||AI Model passport||Did you prepare a full metadata record of all the pieces of information of your model?||1,2,3,4,5,6,7||Yes, I included all the details of the previous points into the model’s passport. I highlighted in the passport the guidance to install and to use safely the AI model.|
|27||Accountability and risk specification||Did you make a risk analysis for your model? Did you prepare a tool to keep track of the usage of your model?||5,6,7||Yes, I evaluated together with clinicians the conditions of potential harm and together we prepared a code of conduct for the use of the AI model. I added a facility to log the usage of the model into the traceability tool.|
|28||User engagement||Did you engage users in the design and development of the AI tool?||1||Yes, during requirements analysis, I organised a co-creation workshop with 10 patients and 10 radiologists.|
|29||Requirements definition||Did you compile end-user requirements?||2||Yes, I compiled a list of 15 functional requirements on how the AI tool would be used in clinical practice.|
|30||User interfaces||Did you design appropriate user interfaces?||3||Yes, I designed a user interfaces based on the user requirements that allows human-computer interaction, as well as visualisation of the images and segmented lesions.|
|31||Usable explainability||Did you implement any type of explainability that will be usable and actionable by the radiologist?||5||Yes, I implemented heatmaps that were requested by the radiologist to better understand the AI predictions.|
|32||Usability testing||Did you design an appropriate usability study?||6||Yes, I designed a usability study with 12 radiologists , 6 early career and 6 with at least 5 years of experience.|
|33||In-silico validation||Did you consider an in-silico validation of usability?||6||Yes, I re-used existing retrospective data (n=100) in a prospective fashion which blinds the researcher to the outcomes, just as in a “real” clinical trial, and examined the behaviour of the radiologist (e.g. agreement between the clinician’s decision and AI’s recommendation).|
|34||Usability metrics||Did you define the appropriate usability metrics for evaluation?||6||Yes, I used a usability questionnaire that measures several usability aspects, including time required to perform the task, learnability, efficiency, explainability, user satisfaction and intention-to-use.|
|35||Clinical Integration||Did you evaluate the usability of your tool after integration in the clinical workflows of the clinical sites?
||6,7||Yes, I tested the integration of my AI tool in the EHR and PACS systems of my collaborating hospital.|
|36||Training material||Did you provide end-users with resources to learn to adopt and appropriately work with your tool?||6,7||Yes, I designed a user guide and organised a video-recorded workshop with the 15 initial clinical users explaining the functions of my AI tool with practical examples.|
|37||Usability monitoring||Did you implement monitoring mechanisms to assess changes in user needs and re-evaluate the appropriateness of the AI solution though time?||5,6,7||Yes, I have developed a mechanism to record user feedback and assess user needs through time in order to early detect changes that would require update or re-design certain functionality/aspects of the product.|
|38||Image harmonization||Did you implement any image harmonization solutions to account for image heterogeneity?||4||Yes, (1) I made use of ComBat, an open-access image standardisation tool, and (2) I also developed a novel algorithm based on GANs for image harmonization on my highly heterogeneous breast cancer dataset.
Also, I used (3) histogram normalisation and (4) bias field correction following the guidance provided in the authors’ reference implementation.
|39||Feature harmonization||Did you perform any feature harmonization study before developing your predictive models? Did you assess, minimise, and report the variation across features?||4||Yes, in my study I investigated the repeatability and reproducibility of radiomic features with respect to three different scanners, variable slice thickness, tube current, and use of intravenous contrast medium, combining phantom studies and human subjects with non-small cell lung cancer. I reported the respective feature variations to increase reproducibility.|
|40||Intra- and inter-observer variability||Did you perform any intra- and inter-observer annotation studies?||4||Yes, I designed a multi-observer study with five different radiologists with 3-15 years of experience from 3 different centres to evaluate the effect of annotation variability in radiomic models by using the DICE coefficient and the coefficient of variance (CV). This study resulted in (1) the removal of those radiomics features with low reproducibility power and (2) reassessment of annotation lesions with high intra- and inter-observer variability.|
|41||Quality control||Did you use any quality control tools to identify abnormal deviations or artifacts in images?||4||Yes, first I applied the MRQy tool to classify lung cancer images in good / bad quality (based on unsupervised learning algorithms) and extracted a quality score from them; then, I introduced this value into my predictive model as an additional feature.|
|42||Phantoms||Did you use phantoms to harmonise patient images and/or measurements?||4||Yes, I used a standard MR system phantom to assess scanner performance, stability, monitoring over time, comparability and to evaluate the accuracy of quantitative relaxation time imaging, which allowed me to perform more accurate and reliable T1 and T2 measurements on patient images.|
|43||Data augmentation for model training||Did you use data augmentation techniques to improve training of AI models?||4,5||Yes, I used the original prostate cancer images to generate a batch of new data by applying both spatial augmentations (rotations, flipping, scaling) and intensity-based augmentations (histogram matching, gaussian noise, brightness) to reduce overfitting and account for extreme cases during the model training phase.|
|44||Training on heterogeneous data||Did you train and evaluate your tools with heterogeneous datasets from multiple clinical centres, vendors, and protocols?||5,6||Yes, I trained my model with a lung cancer dataset that contains images from four centres in France, Greece, Brazil and Japan, performed in different scanners from different manufacturers (Toshiba, GE, Philips and Siemens) and with different acquisition parameters (flip angle, repetition time, field strength), which was accessible through a federated data consortium model.|
|45||Uncertainty estimation||Did you report any kind of model uncertainty beyond the classifier’s discriminant or confidence score?||5,6||Yes, I obtained the predictive uncertainty of my deep learning model for the classification of prostate cancer (by using a deep neural network with a softmax output layer), which allowed me to examine if the incorrectly classified imaging cases were attributed to a higher uncertainty than the correctly classified ones.|
|46||Equity in accessibility||Did you optimise your tool with images from resource-limited settings in low-to-middle countries?||6,7||Yes, I used transfer learning techniques to fine-tune and optimise my automatic segmentation model of breast cancer lesions to new unseen imaging data from a South African medical centre. I validated the final model with an external dataset from an Egyptian hospital.|
|47||Clinical requirements on explainability||Did you consult with the clinicians to determine which explainability methods suit them? Did you intuitively present the different explanation methods to the clinicians and did they develop a clear understanding of them?||
|Yes, I presented four possible explanation methods (attribution maps, concept attribution, SHAPley values, LIME) to the doctors using dummy examples and they selected the attribution maps as well as concept attribution as candidate methods for explainability in this particular application.|
|48||Incorporation of clinical concepts||Did you consider using clinical annotations and clinical concepts as parameters of the AI algorithms or neural networks to explicitly introduce a level of clinical interpretability?||
|Yes, for my AI algorithm that estimates lung lesion malignancy, I annotated and included 6 clinical concepts (lesion subtlety, sphericity, margin, lobulation, spiculation, and texture) as inputs, which facilitated the clinical explainability of the predictions.|
|49||Multiple explanation methods||Did you explore multiple and complementary explainability methods?
|Yes, I used (i) testing with concept activation vector (TCAV) and (ii) attribution maps for explaining my lung lesion classification. The first method provides information on common characteristics that are relevant to the classification model, while the second one highlights local areas of the images that are important for each specific case.|
|50||Identifying explainable biomarkers||To increase clinical value, did you evaluate if the explainability methods enable to identify variables or features that can serve as biomarkers? Did you determine if the identified imaging biomarkers are previously known?||
|Yes, a qualitative analysis of attribution maps revealed that the model uses the skin outside the lesion for the diagnosis of pigmented actinic keratosis. Yes, this is an already known biomarker.|
|51||Quantitative evaluation of explainability||Did you use some quantitative evaluation tests to determine if the explanations are robust and trustworthy?||
|Yes, I performed model randomisation tests, data randomisation tests, reproducibility tests and determined Area Over Perturbation Curve (AOPC) for the quantitative evaluation of the attribution maps.|
|52||Qualitative evaluation of explainability||Did you perform some qualitative evaluation tests with clinicians?||
|Yes, the System Causability Scale (SCS) was used by my clinical collaborators to rate the explanations.|
|53||Robustness of explainability against adversarial attacks||Did you evaluate robustness to adversarial attacks, by assessing if the explanations remain consistent when the input images are subjected to small input perturbations and noise?||
|Yes, I applied small input perturbations to the input image and added noise to generate new input images indistinguishable from the original. I found that the classifications did not change but the attribution maps highlighted very different areas of the tissues than before.|
|Did you evaluate the effect of explainability methods in clinical practice by performing a collaborative human-AI study in which the doctor performs clinical task using the AI tool with and without explanations? Did you identify any resulting bias from the introduction of the explainability methods?||
|Yes, the radiologists utilised the AI tool with and without the explanations for my diagnosis AI tool. The human-AI collaboration led to better detection of pathological cases when the explanations were also available. However, it also led to increased false positives as the attribution maps influenced the clinicians. The area highlighted by attribution maps was interpreted as diseased tissues.|