Bayesian generation of synthetic datasets for machine-learning tasks: a performance study

Fosci, Paolo; Nieves Acedo, Javier; Psaila, Giuseppe; Boffelli, Jacopo; García Bringas, Pablo

Bayesian generation of synthetic datasets for machine-learning tasks: a performance study

dc.contributor.author	Fosci, Paolo
dc.contributor.author	Nieves Acedo, Javier
dc.contributor.author	Psaila, Giuseppe
dc.contributor.author	Boffelli, Jacopo
dc.contributor.author	García Bringas, Pablo
dc.date.accessioned	2026-02-20T16:36:13Z
dc.date.available	2026-02-20T16:36:13Z
dc.date.issued	2026-03-14
dc.date.updated	2026-02-20T16:36:13Z
dc.description.abstract	Performing Machine Learning (ML) tasks on large-scale datasets, as well as simply storing them for subsequent analysis or for long-term archival, require large computational power. The described approach builds on the technique known as “Bayesian Generation” to produce synthetic datasets in such a way that the probability distribution in the source dataset is maintained as much as possible in the new synthetic ones, even if they are much smaller than the original (large) dataset. In fact, this study investigates the impact of generating smaller synthetic datasets for training ML models in place of the original dataset, adopting a twofold perspective. Firstly, the impact on the effectiveness of ML models trained on these smaller synthetic datasets is assessed. Secondly, the amount of computational resources required to generate the synthetic datasets, train ML models on them, and perform the testing phase is measured. Specifically, both execution time and main memory usage are taken into account. Finally, this research work shows that the loss in terms of effectiveness remains consistently limited and stable, and it identifies the scenarios and ML techniques for which incorporating the generation of small synthetic datasets into the ML pipeline can be beneficial for practical deployment in environments with constrained computational resources, such as mobile or industrial devices.	en
dc.description.sponsorship	This study was funded by the European Union - NextGenerationEU, within the framework of the GRINS - Growing Resilient, INclusive and Sustainable project (GRINS PE00000018 – CUP F83C22001720001)	en
dc.identifier.citation	Fosci, P., Nieves, J., Psaila, G., Boffelli, J., & Garcia Bringas, P. (2026). Bayesian generation of synthetic datasets for machine-learning tasks: a performance study. Neurocomputing, 670. https://doi.org/10.1016/J.NEUCOM.2025.132508
dc.identifier.doi	10.1016/J.NEUCOM.2025.132508
dc.identifier.eissn	1872-8286
dc.identifier.issn	0925-2312
dc.identifier.uri	https://hdl.handle.net/20.500.14454/5189
dc.language.iso	eng
dc.publisher	Elsevier B.V.
dc.rights	© 2025 The Author(s)
dc.subject.other	Bayesian generation
dc.subject.other	Bayesian networks
dc.subject.other	Effectiveness and efficiency
dc.subject.other	Generation of synthetic data
dc.subject.other	The YABaGen tool
dc.title	Bayesian generation of synthetic datasets for machine-learning tasks: a performance study	en
dc.type	journal article
dcterms.accessRights	open access
oaire.citation.title	Neurocomputing
oaire.citation.volume	670
oaire.licenseCondition	https://creativecommons.org/licenses/by/4.0/
oaire.version	VoR

Archivos

Bloque original

Mostrando 1 - 1 de 1

Nombre:: fosci_bayesian_2026.pdf
Tamaño:: 2.97 MB
Formato:: Adobe Portable Document Format

Descargar

Colecciones

Artículos