Scientific article
OA Policy
English

Impact of harmonization and oversampling methods on radiomics analysis of multi-center imbalanced datasets : application to PET-based prediction of lung cancer subtypes

Published inEJNMMI physics, vol. 12, no. 1, 34
First online date2025-04-07
Abstract

Background: Medical imaging data frequently encounter image-generation heterogeneity and class imbalance properties, challenging strong generalized predictive performances with data-driven machine-learning methods. The purpose of this study was to investigate the impact of harmonization and oversampling methods on multi-center imbalanced datasets, with specific application to PET-based radiomics modeling for histologic subtype prediction in non-small cell lung cancer (NSCLC).

Methods: The retrospective study included 245 patients with adenocarcinoma (ADC) and 78 patients with squamous cell carcinoma (SCC) from 4 centers. Utilizing 1502 radiomics features per patient, we trained, validated, and tested 4 machine-learning classifiers, to investigate the effect of no harmonization (NoH) or 4 feature harmonization methods, paired with no oversampling (NoO) or 5 oversampling methods on subtype prediction. Model performance was evaluated using the average area under the ROC curve (AUROC) and G-mean via 5 times 5-fold cross-validations. Statistical comparisons of the combined models against baseline (NoH + NoO) were performed for each fold of cross-validation using the DeLong test.

Results: The number of cross-combinations with both AUROC and G-mean outperforming baseline in validation and testing was 15, 4, 2, and 7 (out of 29) for random forest (RF), linear discriminant analysis (LDA), logistic regression (LR), and support vector machine (SVM), respectively. ComBat harmonization combined with oversampling (SMOTE) via RF yielded better performance than baseline (AUROC and G-mean of validation: 0.725 vs. 0.608 and 0.625 vs. 0.398; testing: 0.637 vs. 0.567 and 0.506 vs. 0.287), though statistical significances were not observed.

Conclusions: Applying harmonization and oversampling methods in multi-center imbalanced datasets can improve NSCLC-subtype prediction, but the effect varies widely across classifiers. We have created open-source comparisons of harmonization and oversampling on different classifiers for comprehensive evaluations in different studies.

Keywords
  • Harmonization
  • Multi-center imbalanced datasets
  • NSCLC
  • Oversampling
  • PET radiomics
Funding
  • National Natural Science Foundation of China - [62371221]
  • Science and Technology Program of Guangdong Province - [2022A0505050039]
  • Natural Sciences and Engineering Research Council of Canada - [RGPIN-2019-06467]
  • Natural Science Foundation of Inner Mongolia Autonomous Region - [2024QN08063]
Citation (ISO format)
DU, Dongyang et al. Impact of harmonization and oversampling methods on radiomics analysis of multi-center imbalanced datasets : application to PET-based prediction of lung cancer subtypes. In: EJNMMI physics, 2025, vol. 12, n° 1, p. 34. doi: 10.1186/s40658-025-00750-7
Main files (1)
Article (Published version)
Identifiers
Journal ISSN2197-7364
22views
11downloads

Technical informations

Creation04/08/2025 7:59:57 AM
First validation04/16/2025 8:51:18 AM
Update time04/16/2025 8:51:18 AM
Status update04/16/2025 8:51:18 AM
Last indexation04/16/2025 8:51:19 AM
All rights reserved by Archive ouverte UNIGE and the University of GenevaunigeBlack