Vision-language zero-shot models for radiographic image classification: a systematic review

Guerrero Tamayo, Ana; Oleagordia Ruiz, Ibon; García-Zapirain, Begoña

Vision-language zero-shot models for radiographic image classification: a systematic review

dc.contributor.author	Guerrero Tamayo, Ana
dc.contributor.author	Oleagordia Ruiz, Ibon
dc.contributor.author	García-Zapirain, Begoña
dc.date.accessioned	2026-03-03T18:47:48Z
dc.date.available	2026-03-03T18:47:48Z
dc.date.issued	2026-03
dc.date.updated	2026-03-03T18:47:48Z
dc.description.abstract	Zero-shot Vision-Language Models (VLMs) link visual and textual features, enabling generalization to unseen domains, making them promising for radiographic diagnosis, though clinical adoption is limited. This systematic review examines zero-shot VLMs applied to radiographic image classification, following the PRISMA methodology. Articles were identified from IEEE, PubMed, Scopus, and Web of Science, with 16 selected after exhaustive screening. The analysis addressed five research questions (RQ1–RQ5) covering dataset characteristics, model attributes, natural language integration, reported limitations, and hyperparameter tuning. Geographically, China (37%) and the United States (38%) contributed 75% of the reviewed studies, with no EU-led research identified, highlighting the need for increased European engagement in this field. Architecturally (RQ2), high heterogeneity exists, with dual-encoder (43.75%) and attention-based fusion models most common. Most models (81.25%) employ a Joint Embedding Space for multimodal alignment. Regarding datasets and natural language use (RQ1, RQ3), VLMs rely on few large but semantically narrow datasets, limiting generalizability and amplifying bias. Real clinical reports (direct supervision) and implicit pretrained textual embeddings each represent 37.5% of strategies, yet unstructured clinical text is underutilized. Limited vision-language integration negatively affects performance and explainability (RQ4). Hyperparameter tuning (RQ5) is rarely reported, with 9 of 16 studies not specifying methods, compromising reproducibility. There is an urgent need for open, multilingual, multimodal datasets reflecting clinical and geographic diversity. Clinically useful zero-shot VLMs require transparent evaluation, including explainability metrics. Future models should adopt a multidisciplinary approach, combining technical innovation with usability, data representativeness, and methodological transparency to ensure diagnostic robustness.	en
dc.description.sponsorship	This work has been supported by the Basque Government through the Hazitek 2024 program, Spain , within the framework of the IRUD-IA project: “Medical Image Analysis Technologies with Artificial Intelligence for the Development of Medical Devices”, project code ZE-2024/00030	en
dc.identifier.citation	Guerrero-Tamayo, A., Oleagordia-Ruiz, I., & Garcia-Zapirain, B. (2026). Vision-language zero-shot models for radiographic image classification: a systematic review. Machine Learning with Applications, 23. https://doi.org/10.1016/J.MLWA.2025.100826
dc.identifier.doi	10.1016/J.MLWA.2025.100826
dc.identifier.eissn	2666-8270
dc.identifier.uri	https://hdl.handle.net/20.500.14454/5321
dc.language.iso	eng
dc.publisher	Elsevier Ltd
dc.rights	© 2025 The Authors
dc.subject.other	Image classification
dc.subject.other	Radiographic
dc.subject.other	Survey
dc.subject.other	Systematic review
dc.subject.other	Vision-language models
dc.subject.other	X-ray
dc.subject.other	Zero-shot
dc.title	Vision-language zero-shot models for radiographic image classification: a systematic review	en
dc.type	review article
dcterms.accessRights	open access
oaire.citation.title	Machine Learning with Applications
oaire.citation.volume	23
oaire.licenseCondition	https://creativecommons.org/licenses/by-nc-nd/4.0/
oaire.version	VoR

Archivos

Bloque original

Mostrando 1 - 1 de 1

Nombre:: guerrero_vision_2026.pdf
Tamaño:: 2.5 MB
Formato:: Adobe Portable Document Format

Descargar

Colecciones

Artículos