Enhancing design of experiments through uncertainty estimation and synthetic data generation

dc.contributor.authorMoles, Luis
dc.contributor.authorAndrés Fernández, Alain
dc.contributor.authorEchegaray López, Goretti
dc.contributor.authorBoto Sánchez, Fernando
dc.date.accessioned2026-03-13T11:49:23Z
dc.date.available2026-03-13T11:49:23Z
dc.date.issued2026-03
dc.date.updated2026-03-13T11:49:23Z
dc.description.abstractDesign of Experiments is a key methodology for optimizing machine learning models, but traditional methods often depend on extensive real data collection, which is costly and time-consuming. Moreover, predefined experimental designs may struggle at adapting to complex or high-dimensional input spaces, sometimes leading to inefficient exploration, especially when data are scarce and uncertainty is high. To address these challenges, we propose a methodology that integrates uncertainty estimation with synthetic data generation. First, we evaluate several uncertainty estimators (Gaussian Process, Monte Carlo Dropout and Tree-based ensembles) which identify the input regions where the current model is most uncertain. Next, we analyze different generative models (Variational Autoencoders, Generative Adversarial Networks, and Large Language Models) trained under varying levels of data availability (from only 10% of the real dataset up to full data), to test their robustness in extreme scarcity conditions. Finally, we combine the best uncertainty estimator with the most reliable generative model in a hybrid active learning pipeline. Beyond the standard setting, we systematically vary the number and proportion of synthetic versus real samples, showing how the mixture affects predictive accuracy and uncertainty reduction. Results of the experimentation show that Gaussian Process uncertainty estimation outperforms other tested methods under extreme data scarcity, and that Variational Autoencoders produce the most stable synthetic samples with as little as 10% of the real data used for training. The full hybrid loop (Gaussian Process + Variational Autoencoder) achieves similar R2 to baselines while driving down uncertainty significantly faster, offering a data-efficient strategy for costly experimental contexts.en
dc.description.sponsorshipThe authors gratefully acknowledge the financial support given by the Basque Government (Eusko Jaurlaritza) under “Programa de apoyo a la investigación colaborativa en áreas estratégicas” (Project BISUM II: Ref. KK-2024/00048) programsen
dc.identifier.citationMoles, L., Andres, A., Echegaray, G., & Boto, F. (2026). Enhancing design of experiments through uncertainty estimation and synthetic data generation. Results in Engineering, 29. https://doi.org/10.1016/J.RINENG.2026.109409
dc.identifier.doi10.1016/J.RINENG.2026.109409
dc.identifier.eissn2590-1230
dc.identifier.urihttps://hdl.handle.net/20.500.14454/5440
dc.language.isoeng
dc.publisherElsevier B.V.
dc.subject.otherData augmentation
dc.subject.otherDesign of experiments
dc.subject.otherGaussian process
dc.subject.otherSynthetic data
dc.subject.otherUncertainty estimation
dc.titleEnhancing design of experiments through uncertainty estimation and synthetic data generationen
dc.typejournal article
dcterms.accessRightsopen access
oaire.citation.titleResults in Engineering
oaire.citation.volume29
oaire.licenseConditionhttps://creativecommons.org/licenses/by-nc-nd/4.0/
oaire.versionVoR
Archivos
Bloque original
Mostrando 1 - 1 de 1
Cargando...
Miniatura
Nombre:
moles_enhancing_2026.pdf
Tamaño:
7.32 MB
Formato:
Adobe Portable Document Format
Colecciones