Studying the impact of data preprocessing, hyperparameter tuning and machine learning algorithms in crash prediction explainability

Díaz Aparicio, Jon; Rodríguez Esparza, Erick; Fajardo Calderín, Jenny; Onieva Caracuel, Enrique

Studying the impact of data preprocessing, hyperparameter tuning and machine learning algorithms in crash prediction explainability

dc.contributor.author	Díaz Aparicio, Jon
dc.contributor.author	Rodríguez Esparza, Erick
dc.contributor.author	Fajardo Calderín, Jenny
dc.contributor.author	Onieva Caracuel, Enrique
dc.date.accessioned	2026-04-29T18:29:40Z
dc.date.available	2026-04-29T18:29:40Z
dc.date.issued	2026-07
dc.date.updated	2026-04-29T18:29:40Z
dc.description.abstract	Road traffic crashes remain a major global concern, causing more than 1.3 million fatalities each year and underscoring the need for improved tools to understand and predict crash occurrence. This study presents an integrated retrospective crash-risk screening framework that merges four heterogeneous data sources (crash records, road infrastructure, connected vehicle data, and travel demand) to model road-segment crash risk in Madrid. Ten preprocessing configurations are created using oversampling (generate instances of the minority class), undersampling (removing instances of the dominant class), dataset expansion (new data generation), and SMOTE, each tested with and without normalization. Seven machine-learning algorithms (tree ensembles and SVMs) are evaluated under regression, multiclass classification, and binary classification formulations, resulting in a total of 210 experiments. Binary classification delivered the best performance, with Gradient boosting trained on normalized, undersampled data emerging as the strongest model. Subsequent Bayesian hyperparameter optimization further enhanced its predictive capability. Explainable AI analysis using SHAP values revealed that braking events are the most influential predictors of crash likelihood, followed by road length and traffic demand, emphasizing the relevance of driver-behavior indicators in safety modeling. Overall, the findings demonstrate the benefits of integrating traditional crash data with emerging connected vehicle and demand-based information. The study provides evidence that explainable machine learning approaches can effectively support data-driven decision-making for road-safety management and targeted intervention planning.	en
dc.description.sponsorship	This research & Innovation Programme, Spain under Grant Agreement No 101077433 [project SOTERIA (Systematic and orchestrated deployment of safety solutions in complex urban environments for ageing and vulnerable societies)]. This work has also been partially funded by the Spain, Ministry of Science, Innovation and Universities through the RENAISSANCE project [PID2022-140612OB-I00]	en
dc.identifier.citation	Díaz-Aparicio, J., Rodríguez-Esparza, E., Fajardo-Calderín, J., & Onieva, E. (2026). Studying the impact of data preprocessing, hyperparameter tuning and machine learning algorithms in crash prediction explainability. Array, 30. https://doi.org/10.1016/J.ARRAY.2026.100743
dc.identifier.doi	10.1016/J.ARRAY.2026.100743
dc.identifier.eissn	2590-0056
dc.identifier.uri	https://hdl.handle.net/20.500.14454/5825
dc.language.iso	eng
dc.publisher	Elsevier B.V.
dc.rights	© 2026 The Author(s)
dc.subject.other	Crash prediction
dc.subject.other	Explainable AI
dc.subject.other	Imbalanced learning
dc.subject.other	Machine learning
dc.subject.other	Road safety
dc.title	Studying the impact of data preprocessing, hyperparameter tuning and machine learning algorithms in crash prediction explainability	en
dc.type	journal article
dcterms.accessRights	open access
oaire.citation.title	Array
oaire.citation.volume	30
oaire.licenseCondition	https://creativecommons.org/licenses/by-nc/4.0/
oaire.version	VoR

Archivos

Bloque original

Mostrando 1 - 1 de 1

Nombre:: diaz_studying_2026.pdf
Tamaño:: 1.93 MB
Formato:: Adobe Portable Document Format

Descargar

Colecciones

Artículos