Training set augmentation and biology-aware harmonization improve radiomic models for lung cancer prediction in indeterminate nodules

C. Huchthausen, M. Shi, G.L.A. de Sousa, J. Larner, E. Janowski, J. Colen, and K. Wijesooriya

Published in Scientific Reports, 2026

CT radiomics-based machine learning has potential to predict lung cancer in pulmonary nodules (PNs) earlier than standard-of-care methods. Low malignancy rates in early-development PNs and variable image acquisition hinder development of radiomic models for diagnosing these PNs. To address these challenges, we augmented training using later-development PNs and harmonized for acquisition effects. We examine early-development benign and malignant PNs (n = 106) below the sensitivity of standard-of-care diagnosis. Classifiers predicting malignancy performed near chance when trained on ComBat-harmonized radiomic features from only early-development PNs. We then augmented training with later-development benign and malignant PNs (n = 225). We evaluated whether harmonization must incorporate biology that impacts acquisition effects in added training data. To correct variability from four acquisition protocols, we compared: (1) biology-unaware harmonization, (2) harmonizing with a covariate distinguishing early-development, later-development benign, later-development malignant datasets, (3) harmonizing each dataset separately. Models trained using augmentation, but biology-unaware harmonization, failed to improve consistently. Augmented training data harmonized with a covariate (ROC-AUC 0.74 [0.69–0.79]) or separately (ROC-AUC 0.71 [0.66–0.77]) yielded higher test ROC-AUC (Delong, p ≤ 0.05) and PR-AUC (Wilcoxon, p ≤ 0.05). In a proof-of-principle methodological study, we demonstrate with a small single-center dataset that combining radiomic features from later-development benign and malignant PNs requires biology-aware harmonization.

Direct Link Preprint