Studying the impact of data preprocessing, hyperparameter tuning and machine learning algorithms in crash prediction explainability
| dc.contributor.author | Díaz Aparicio, Jon | |
| dc.contributor.author | Rodríguez Esparza, Erick | |
| dc.contributor.author | Fajardo Calderín, Jenny | |
| dc.contributor.author | Onieva Caracuel, Enrique | |
| dc.date.accessioned | 2026-04-29T18:29:40Z | |
| dc.date.available | 2026-04-29T18:29:40Z | |
| dc.date.issued | 2026-07 | |
| dc.date.updated | 2026-04-29T18:29:40Z | |
| dc.description.abstract | Road traffic crashes remain a major global concern, causing more than 1.3 million fatalities each year and underscoring the need for improved tools to understand and predict crash occurrence. This study presents an integrated retrospective crash-risk screening framework that merges four heterogeneous data sources (crash records, road infrastructure, connected vehicle data, and travel demand) to model road-segment crash risk in Madrid. Ten preprocessing configurations are created using oversampling (generate instances of the minority class), undersampling (removing instances of the dominant class), dataset expansion (new data generation), and SMOTE, each tested with and without normalization. Seven machine-learning algorithms (tree ensembles and SVMs) are evaluated under regression, multiclass classification, and binary classification formulations, resulting in a total of 210 experiments. Binary classification delivered the best performance, with Gradient boosting trained on normalized, undersampled data emerging as the strongest model. Subsequent Bayesian hyperparameter optimization further enhanced its predictive capability. Explainable AI analysis using SHAP values revealed that braking events are the most influential predictors of crash likelihood, followed by road length and traffic demand, emphasizing the relevance of driver-behavior indicators in safety modeling. Overall, the findings demonstrate the benefits of integrating traditional crash data with emerging connected vehicle and demand-based information. The study provides evidence that explainable machine learning approaches can effectively support data-driven decision-making for road-safety management and targeted intervention planning. | en |
| dc.description.sponsorship | This research & Innovation Programme, Spain under Grant Agreement No 101077433 [project SOTERIA (Systematic and orchestrated deployment of safety solutions in complex urban environments for ageing and vulnerable societies)]. This work has also been partially funded by the Spain, Ministry of Science, Innovation and Universities through the RENAISSANCE project [PID2022-140612OB-I00] | en |
| dc.identifier.citation | Díaz-Aparicio, J., Rodríguez-Esparza, E., Fajardo-Calderín, J., & Onieva, E. (2026). Studying the impact of data preprocessing, hyperparameter tuning and machine learning algorithms in crash prediction explainability. Array, 30. https://doi.org/10.1016/J.ARRAY.2026.100743 | |
| dc.identifier.doi | 10.1016/J.ARRAY.2026.100743 | |
| dc.identifier.eissn | 2590-0056 | |
| dc.identifier.uri | https://hdl.handle.net/20.500.14454/5825 | |
| dc.language.iso | eng | |
| dc.publisher | Elsevier B.V. | |
| dc.rights | © 2026 The Author(s) | |
| dc.subject.other | Crash prediction | |
| dc.subject.other | Explainable AI | |
| dc.subject.other | Imbalanced learning | |
| dc.subject.other | Machine learning | |
| dc.subject.other | Road safety | |
| dc.title | Studying the impact of data preprocessing, hyperparameter tuning and machine learning algorithms in crash prediction explainability | en |
| dc.type | journal article | |
| dcterms.accessRights | open access | |
| oaire.citation.title | Array | |
| oaire.citation.volume | 30 | |
| oaire.licenseCondition | https://creativecommons.org/licenses/by-nc/4.0/ | |
| oaire.version | VoR |
Archivos
Bloque original
1 - 1 de 1
Cargando...
- Nombre:
- diaz_studying_2026.pdf
- Tamaño:
- 1.93 MB
- Formato:
- Adobe Portable Document Format