Enhancing design of experiments through uncertainty estimation and synthetic data generation

Moles, Luis; Andrés Fernández, Alain; Echegaray López, Goretti; Boto Sánchez, Fernando

Enhancing design of experiments through uncertainty estimation and synthetic data generation

dc.contributor.author	Moles, Luis
dc.contributor.author	Andrés Fernández, Alain
dc.contributor.author	Echegaray López, Goretti
dc.contributor.author	Boto Sánchez, Fernando
dc.date.accessioned	2026-03-13T11:49:23Z
dc.date.available	2026-03-13T11:49:23Z
dc.date.issued	2026-03
dc.date.updated	2026-03-13T11:49:23Z
dc.description.abstract	Design of Experiments is a key methodology for optimizing machine learning models, but traditional methods often depend on extensive real data collection, which is costly and time-consuming. Moreover, predefined experimental designs may struggle at adapting to complex or high-dimensional input spaces, sometimes leading to inefficient exploration, especially when data are scarce and uncertainty is high. To address these challenges, we propose a methodology that integrates uncertainty estimation with synthetic data generation. First, we evaluate several uncertainty estimators (Gaussian Process, Monte Carlo Dropout and Tree-based ensembles) which identify the input regions where the current model is most uncertain. Next, we analyze different generative models (Variational Autoencoders, Generative Adversarial Networks, and Large Language Models) trained under varying levels of data availability (from only 10% of the real dataset up to full data), to test their robustness in extreme scarcity conditions. Finally, we combine the best uncertainty estimator with the most reliable generative model in a hybrid active learning pipeline. Beyond the standard setting, we systematically vary the number and proportion of synthetic versus real samples, showing how the mixture affects predictive accuracy and uncertainty reduction. Results of the experimentation show that Gaussian Process uncertainty estimation outperforms other tested methods under extreme data scarcity, and that Variational Autoencoders produce the most stable synthetic samples with as little as 10% of the real data used for training. The full hybrid loop (Gaussian Process + Variational Autoencoder) achieves similar R2 to baselines while driving down uncertainty significantly faster, offering a data-efficient strategy for costly experimental contexts.	en
dc.description.sponsorship	The authors gratefully acknowledge the financial support given by the Basque Government (Eusko Jaurlaritza) under “Programa de apoyo a la investigación colaborativa en áreas estratégicas” (Project BISUM II: Ref. KK-2024/00048) programs	en
dc.identifier.citation	Moles, L., Andres, A., Echegaray, G., & Boto, F. (2026). Enhancing design of experiments through uncertainty estimation and synthetic data generation. Results in Engineering, 29. https://doi.org/10.1016/J.RINENG.2026.109409
dc.identifier.doi	10.1016/J.RINENG.2026.109409
dc.identifier.eissn	2590-1230
dc.identifier.uri	https://hdl.handle.net/20.500.14454/5440
dc.language.iso	eng
dc.publisher	Elsevier B.V.
dc.subject.other	Data augmentation
dc.subject.other	Design of experiments
dc.subject.other	Gaussian process
dc.subject.other	Synthetic data
dc.subject.other	Uncertainty estimation
dc.title	Enhancing design of experiments through uncertainty estimation and synthetic data generation	en
dc.type	journal article
dcterms.accessRights	open access
oaire.citation.title	Results in Engineering
oaire.citation.volume	29
oaire.licenseCondition	https://creativecommons.org/licenses/by-nc-nd/4.0/
oaire.version	VoR

Archivos

Bloque original

Mostrando 1 - 1 de 1

Nombre:: moles_enhancing_2026.pdf
Tamaño:: 7.32 MB
Formato:: Adobe Portable Document Format

Descargar

Colecciones

Artículos