Studying the impact of data preprocessing, hyperparameter tuning and machine learning algorithms in crash prediction explainability

dc.contributor.authorDíaz Aparicio, Jon
dc.contributor.authorRodríguez Esparza, Erick
dc.contributor.authorFajardo Calderín, Jenny
dc.contributor.authorOnieva Caracuel, Enrique
dc.date.accessioned2026-04-29T18:29:40Z
dc.date.available2026-04-29T18:29:40Z
dc.date.issued2026-07
dc.date.updated2026-04-29T18:29:40Z
dc.description.abstractRoad traffic crashes remain a major global concern, causing more than 1.3 million fatalities each year and underscoring the need for improved tools to understand and predict crash occurrence. This study presents an integrated retrospective crash-risk screening framework that merges four heterogeneous data sources (crash records, road infrastructure, connected vehicle data, and travel demand) to model road-segment crash risk in Madrid. Ten preprocessing configurations are created using oversampling (generate instances of the minority class), undersampling (removing instances of the dominant class), dataset expansion (new data generation), and SMOTE, each tested with and without normalization. Seven machine-learning algorithms (tree ensembles and SVMs) are evaluated under regression, multiclass classification, and binary classification formulations, resulting in a total of 210 experiments. Binary classification delivered the best performance, with Gradient boosting trained on normalized, undersampled data emerging as the strongest model. Subsequent Bayesian hyperparameter optimization further enhanced its predictive capability. Explainable AI analysis using SHAP values revealed that braking events are the most influential predictors of crash likelihood, followed by road length and traffic demand, emphasizing the relevance of driver-behavior indicators in safety modeling. Overall, the findings demonstrate the benefits of integrating traditional crash data with emerging connected vehicle and demand-based information. The study provides evidence that explainable machine learning approaches can effectively support data-driven decision-making for road-safety management and targeted intervention planning.en
dc.description.sponsorshipThis research & Innovation Programme, Spain under Grant Agreement No 101077433 [project SOTERIA (Systematic and orchestrated deployment of safety solutions in complex urban environments for ageing and vulnerable societies)]. This work has also been partially funded by the Spain, Ministry of Science, Innovation and Universities through the RENAISSANCE project [PID2022-140612OB-I00]en
dc.identifier.citationDíaz-Aparicio, J., Rodríguez-Esparza, E., Fajardo-Calderín, J., & Onieva, E. (2026). Studying the impact of data preprocessing, hyperparameter tuning and machine learning algorithms in crash prediction explainability. Array, 30. https://doi.org/10.1016/J.ARRAY.2026.100743
dc.identifier.doi10.1016/J.ARRAY.2026.100743
dc.identifier.eissn2590-0056
dc.identifier.urihttps://hdl.handle.net/20.500.14454/5825
dc.language.isoeng
dc.publisherElsevier B.V.
dc.rights© 2026 The Author(s)
dc.subject.otherCrash prediction
dc.subject.otherExplainable AI
dc.subject.otherImbalanced learning
dc.subject.otherMachine learning
dc.subject.otherRoad safety
dc.titleStudying the impact of data preprocessing, hyperparameter tuning and machine learning algorithms in crash prediction explainabilityen
dc.typejournal article
dcterms.accessRightsopen access
oaire.citation.titleArray
oaire.citation.volume30
oaire.licenseConditionhttps://creativecommons.org/licenses/by-nc/4.0/
oaire.versionVoR
Archivos
Bloque original
Mostrando 1 - 1 de 1
Cargando...
Miniatura
Nombre:
diaz_studying_2026.pdf
Tamaño:
1.93 MB
Formato:
Adobe Portable Document Format
Colecciones