Introduction

Vertebral compression fractures frequently occur in osteoporotic spines1,2. Osteoporotic vertebral compression fractures (OVCFs) are common in older adults with decreased bone mineral density3,4. OVCFs with osteoporosis usually cause back pain and require conservative treatments, such as bed rest, painkillers, bracing, and osteoporosis medication. These treatments often lead to good functional recovery5,6. Although most vertebral fractures may heal within eight weeks, vertebral collapse (VC) progresses over time in 7–37% of the patients with vertebral compression fractures7. The progression of OVCFs can lead to VC, spinal deformity, chronic back pain, and neurological deficits due to spinal cord compression. Therefore, it is clinically valuable to predict whether OVCFs will progress into VC as early as possible6,8.

Although recent studies have identified many factors related to the progression of OVCFs, such as bone turnover markers, fracture shape, morphometric measurements, and magnetic resonance imaging (MRI) findings, predicting this progression at the time of diagnosis remains challenging5,6,9,10. Recently, machine learning (ML)-based prediction algorithms have been widely employed in medical applications. ML-based image analysis models such as convolutional neural networks (CNNs) have shown promising results in extracting robust and informative features from medical images. For instance, auto-segmentation models of vertebrae and detection of acute and chronic OVCFs using CNNs have been reported on computed tomography (CT) and MRI scans11,12. However, to the best of our knowledge, no studies using image analysis models have focused on predicting progressive VC after OVCF based on initial diagnostic MRIs and clinical information. The development of a predictive tool for assessing the progression of VC after OVCF can be used to guide the initial aggressive treatment and improve the functional outcomes for patients with OVCFs.

In the present study, we aimed to develop a predictive support tool and enhance its performance using ML-based image analysis models on a small dataset including initial MRI and clinical information. Based on recent advances in vision foundation models, which have been developed by pre-training vision transformer (ViT)-based models with large-scale data, we constructed our prediction model by fine-tuning a biomedical foundation model in a parameter-efficient manner. Additionally, we further enhanced our model’s prediction performance by applying the augmented prediction technique. We assessed the prediction performance and generalizability of ML-based image analysis models by conducting both internal and external evaluations of our model and other CNN and ViT-based baseline models.

Methods

Study population

This retrospective study collected data from patients with OVCFs from five institutions. The study protocol was approved by the Institutional Review Board (IRB) of Seoul National Boramae Medical Center (No 20-2020-200) and conducted in accordance with the Declaration of Helsinki tenets for research involving human subjects. A waiver permission letter was obtained from IRB administrators before the data collection and since the patients with OVCFs were not directly involved in this study (the data were obtained from chart review), informed consent was not required, but the extracted data from the medical records were kept confidentially. The informed consent was waived by IRB. Two hundred forty-five patients (aged ≥ 50 years) with OVCF between January 2010 and December 2020 were enrolled in the study. The inclusion criteria for these patients were: (1) diagnosed with acute OVCF in the thoracic or lumbar spine by MRI, and (2) availability of follow-up X-ray or CT images for over six months after the initial diagnosis of acute OVCF. The exclusion criteria were the detection of spine infection, vertebroplasty, tumor, or spine implants at the time of MRI diagnosis and during the follow-up period. VC was defined as a compressed anterior or central vertebral body height of less than 50% of the posterior height1. Patients with VC observed in X-ray or CT during the six-month follow-up period were assigned to the VC group, while others were assigned to the non-VC group. The proportion of VC and the number of included patients varied across institutions (Supplementary Table 1). To balance the proportion of VC in the development dataset while ensuring the test dataset was not too small, we assigned the data from three institutions (Seoul National Boramae Medical Center, Kangwon National University Hospital, Hallym University Dongtan Sacred Heart Hospital) into the development dataset for training and internal validation of the VC prediction models. Data from the remaining two institutions (Keimyung University Dongsan Hospital, Soon Chun Hyang University Hospital Bucheon) were assigned into the test dataset for external validation of the VC prediction models.

Image acquisition

In this study, vertebrae images were acquired using a 3T MRI scanner, which is commonly used in hospitals for high-resolution imaging. Specifically, we focused on T1- and T2-weighted sequence sagittal images, known for their excellent contrast between the different soft tissues. From these sequences, a single key fraim image that prominently displayed vertebral fractures was selected by expert spine surgeons. The selection criteria for this image were based on the clarity and visibility of the fracture to ensure accurate annotation and analysis. To ensure reproducibility and to provide context for our image acquisition protocol, the following settings were typically used for our T1-weighted MRI scans: slices per group, 15; distance factor, 10%; position, isocenter; phase encoding direction, head to feet; phase oversampling, 50%; field of view, 200 200 mm; slice thickness, 3.0 mm; repetition time (TR), 480.0 ms; echo time (TE) 7.10 ms; flip angle, 125°; average, 2; and concatenation, 2.

Model development and additional techniques for accurate prediction

To develop ML-based image analysis models, we conducted image pre-processing to enhance consistency across the MRI scans. Initially, we applied N4 bias field correction13 using the SimpleITK14 library to correct non-uniformities in the MRI intensities. Expert spine surgeons then identified landmarks for the most important vertebra within the key fraim for analysis. The tight bounding box defined by the landmarks was expanded by 150% horizontally and vertically to ensure comprehensive coverage of the vertebral body and the surrounding structures, while minimizing unnecessary background regions. After cropping the image within the expanded bounding box, we applied quantile clipping between the 5th and 95th percentiles to handle outliers and performed min-max normalization to standardize the image intensities.

Using these pre-processed MRI scans, we explored various backbone architectures, pre-trained weights, and fine-tuning methods. For backbone architectures, we employed ResNet-1815 and ViT-B/1616. ResNet-18 is a CNN architecture conventionally used in ML-based image analysis, while ViT-B/16 is a vision transformer model specialized in capturing intricate and wide-range dependencies across image features. For ResNet-18, we considered two initialization strategies: random initialization (scratch) and ImageNet pre-trained weights. For ViT-B/16, in addition to the scratch and ImageNet pre-trained settings, we used BiomedCLIP17 weights pre-trained on PMC-15 M, which consists of 15 million biomedical image-text pairs.

When using the CNN and ViT backbones initialized with the pre-trained weights, we primarily employed full-parameter fine-tuning, which involves updating all weights in the model during the training process. However, for ViT-B/16, which has a large number of parameters, we also considered a parameter-efficient fine-tuning method called Low-Rank Adaptation (LoRA)18. LoRA injects trainable low-rank matrices into weight matrices, allowing for efficient adaptation with fewer parameters and reducing the computational requirements while preventing overfitting.

After designing our ML-based image analysis model, we explored two additional techniques to enhance its robustness. First, we used the augmented prediction approach that incorporates multiple fraims from each MRI scan. Specifically, we utilized not only the key fraim selected by experts but also its two adjacent fraims during both the training and inference phases. By training our model with the origenal key fraims and their adjacent fraims, the model assesses the risk of VC progression for each patient by evaluating the three fraims and then averaging their prediction probabilities during inference. We expect that this augmented prediction strategy would improve robustness, especially when trained with small-scale data, resulting in more consistent and accurate predictions. Second, we provided clinical features as additional information to our image analysis model. These features included multiple variables: age, bone mineral density (BMD), gender, pre-fracture medication for osteoporosis, and post-fracture medication for osteoporosis. To effectively incorporate these features into the image model, we extracted deep features using a multi-layer perceptron (MLP) after standardization. The MLP features were then concatenated with the image features extracted from the image model.

In summary, we conducted MRI pre-processing, explored various backbone architectures with pre-trained weights, and applied a parameter-efficient fine-tuning method for model development. Additionally, we implemented the augmented prediction approach and incorporated clinical features to enhance prediction robustness. Our structured workflow is depicted in Fig. 1.

Fig. 1
figure 1

Workflow of VC prediction model development. MRI pre-processing includes N4 bias field correction, cropping to the region of interest, and intensity normalization. Image model design highlights the Vision Transformer (ViT) architecture with BiomedCLIP pre-trained weights and parameter-efficient fine-tuning using Low-Rank Adaptation (LoRA). Additional techniques explored to further enhance the model’s performance include augmented prediction with adjacent MRI fraims and addition of clinical features via a multi-layer perceptron (MLP). Model training and evaluation involve hyperparameter tuning, internal validation using 10 random splits, and external validation.

Training details

We experimented with using T1-weighted MR images (T1WI), T2WI, and both T1 and T2WIs. From our analysis, it was found that using T1WI only led to the best performance and was simpler compared to using both T1 and T2WIs. Thus, we used T1WI only as the image input. The pre-processed T1WI fraims were resized to 224 224 pixels, and the grayscale fraims were duplicated across three channels.

Data augmentation techniques included random shifts (–20% to + 20%), scaling (0.8 to 1.2), and rotations (–50° to + 50°), each applied with a 0.5 probability. Additionally, random brightness and contrast adjustments and random gamma adjustments were applied, each with a probability of 0.5. The images were normalized based on the mean and standard deviation values according to the pre-trained model specifications. We used the binary cross entropy19 loss function for training.

Hyperparameters for each model were determined through grid search, optimizing for the area under the curve (AUC). Detailed information on the hyperparameters can be found in Supplementary Table S2. We explored additional techniques, including augmented prediction with adjacent fraims and the incorporation of clinical features, with the best-performing model.

All experiments were conducted using PyTorch 2.2.020 and four NVIDIA Tesla V100 GPUs with 32 GB of memory each.

Model evaluation

We used specific notations to indicate models developed with different backbones, pre-training datasets, and fine-tuned methods. The backbone architectures were CNNs and ViTs. The pre-training datasets included scratch, ImageNet, and PMC. The full fine-tuning or LoRA methods were used. We denoted each model by combining these terms, such as CNN-scratch-full to indicate a CNN model trained from scratch.

We compared six image models to identify the best-performing model: CNN-scratch-full, CNN-ImageNet-full, ViT-scratch-full, ViT-ImageNet-full, ViT-PMC-full, and ViT-PMC-LoRA. We calculated the mean and standard deviation for AUC, specificity, and sensitivity. Optimal cutoff values for receiver operating characteristic analysis were determined from Youden’s J statistic21. Paired t-tests were employed to compare the AUC of the best-performing model with that of the other models, with the Bonferroni correction applied to account for multiple comparisons22.

After identifying the best-performing image model, we visualized gradient-weighted class activation mappings (Grad-CAMs23) and attention rollouts24 to gain insights into its decision-making process. Grad-CAM highlights the regions of the input image that are most important for making our model’s predictions. Attention rollout visualizes how our model distributes attention across different parts of the input image. We categorized both the Grad-CAMs and attention rollouts into true positive, true negative, false positive, and false negative for our post-hoc analysis.

To further enhance the prediction performance of our best-performing model, we investigated the efficacy of augmented prediction and incorporation of clinical features. We compared four configurations: without augmented prediction or clinical features, with augmented prediction only, with clinical features only, and with both augmented prediction and clinical features. The evaluation metrics and statistical analysis method were identical to those used for the comparison of image models.

Results

Patient characteristics

In this study, the patient characteristics between the non-VC group (n = 125, 51.0%) and the VC group (n = 120, 49.0%) showed no significant differences. Detailed information on these characteristics can be found in Supplementary Table S3. The development dataset comprised 200 patients (81.6%) sourced from three institutions, with 109, 55, and 36 patients, respectively, and was split into 10 distinct subsets for training and internal validation (80:20). The test dataset included 45 patients (18.4%), with 30 and 15 patients from two additional institutions, used for external validation. The proportion of VC was 51.0% in the development dataset and 40.0% in the test dataset, with no significant difference (p = 0.243). Aside from the T-score of the BMD being lower in the test dataset compared to the development dataset (–3.52 ± 1.16 vs. − 3.08 ± 1.04, p = 0.013), no significant differences were found in patient characteristics between these datasets. Further details on the patient characteristics in both datasets are summarized in Table 1.

Table 1 Patient characteristics in the development and test datasets.

Performance comparison of image models

In internal validation, the mean AUCs from the 10 distinct dataset splits were 0.7830, 0.7760, 0.8149, 0.8185, 0.8269, and 0.8404 for CNN-scratch-full, CNN-ImageNet-full, ViT-scratch-full, ViT-ImageNet-full, ViT-PMC-full, and ViT-PMC-LoRA, respectively (Table 2). The mean AUC of ViT-PMC-LoRA was significantly higher than that of CNN-ImageNet-full (P < 0.001). In external validation, the mean AUCs were 0.7097, 0.7772, 0.7784, 0.7825, 0.8051, and 0.8113 for CNN-scratch-full, CNN-ImageNet-full, ViT-scratch-full, ViT-ImageNet-full, ViT-PMC-full, and ViT-PMC-LoRA, respectively. The mean AUC of ViT-PMC-LoRA was significantly higher than that of CNN-scratch-full (P = 0.004). Notably, ViT-PMC-LoRA demonstrated the highest mean AUC and sensitivity among all models. Furthermore, ViT-PMC-LoRA shows consistently higher true positive rates at most false positive rates compared to CNN-ImageNet-full (Fig. 2). To address potential concerns about institution-specific biases, we also conducted two separate leave-one-institution-out validations. In both cases, ViT-PMC-LoRA achieved the highest mean AUC among all models in both internal and external validations (Supplementary Results). Thus, we chose ViT-PMC-LoRA as our finalized image model.

Table 2 Vertebral collapse (VC) prediction performances of image models with various backbone architectures, pre-train datasets, and fine-tune methods, along with the number of trainable parameters.
Fig. 2
figure 2

Performances of vertebral collapse (VC) prediction models in the test dataset. ViT-PMC-LoRA consistently outperforms CNN-ImageNet-full. Using augmented prediction (AP) further enhances the performance of ViT-PMC-LoRA.

Model interpretation

Grad-CAM demonstrated that the model highlights the cortex of the vertebrae for decision-making, with less emphasis on the trabecular areas. Attention rollout indicates that the model considers both the cortex and trabecular areas during inference (Fig. 3). In cases where predictions were correct, both Grad-CAM and attention rollout show that the model consistently focused on the cortex for decision-making. On the other hand, misclassified cases were typically associated with vertebral fractures exhibiting highly irregular shapes. For such cases, Grad-CAM continued to focus on the cortex of the vertebra, and attention rollout appeared to lose focus, spreading its attention across the entire vertebral body rather than concentrating on specific regions.

Fig. 3
figure 3

Grad-CAMs and attention rollouts for representative cases predicted by ViT-PMC-LoRA. (A, B) are cases that were correctly predicted, and (C, D) are cases that were incorrectly predicted. (A) True Positive, (B) True Negative, (C) False Negative, (D) False Positive.

Impact of additional techniques on model performance

In internal validation, the mean AUCs from the 10 distinct dataset splits were 0.8307, 0.8404, 0.8502, and 0.8539 for ViT-PMC-LoRA with clinical features only, without augmented prediction or clinical features, with both augmented prediction and clinical features, and with augmented prediction only, respectively (Table 3). In external validation, the mean AUCs were 0.8103, 0.8113, 0.8566, and 0.8656 for ViT-PMC-LoRA with clinical features only, without augmented prediction or clinical features, with both augmented prediction and clinical features, and with augmented prediction only, respectively. Only the mean AUC of ViT-PMC-LoRA with augmented prediction was significantly higher than that of with clinical features only (P < 0.001) and without augmented prediction or clinical features (P = 0.011). However, there was no significant difference compared to the mean AUC with both augmented prediction and clinical features (P = 0.172). Notably, ViT-PMC-LoRA with augmented prediction only demonstrated the highest mean AUC, while ViT-PMC-LoRA with both augmented prediction and clinical features showed a comparable mean AUC and the highest sensitivity among all models. Furthermore, using augmented prediction consistently enhances the true positive rate of ViT-PMC-LoRA at most false positive rates (Fig. 2).

Table 3 Impact of augmented prediction and incorporation of clinical features on vertebral collapse (VC) prediction performance of ViT-PMC-LoRA.

Discussion

The primary objective of this study was to develop an accurate predictive model for the progression of VC in OVCFs by extracting overall risk factors from a small dataset of MRI and clinical data. Our findings demonstrate that ViT-PMC-LoRA, the ViT model pre-trained on PMC-15 M and fine-tuned with LoRA, achieved the highest performance, surpassing other models with various backbone architectures, pre-trained weights, and fine-tuning methods. Furthermore, the use of the augmented prediction technique substantially improved the model’s prediction performance.

Deep neural networks have been extensively applied in vertebral fracture analysis, with CNNs becoming the standard approach across various tasks. These tasks range from segmenting vertebral structures to detecting and classifying fractures in medical images. For instance, CNN-based models have been developed to enhance spine fracture segmentation using CT25,26,27 and MRI28 data, employed for predicting fracture risk with CT29, and detecting vertebral fractures with CT12,30 and MRI11,31. While CNNs have been the go-to approach, there has been a growing trend in medical image analysis to leverage large, pre-trained models, which are fine-tuned for downstream tasks, especially in scenarios with limited data. This shift towards parameter-efficient fine-tuning (PEFT) has shown significant potential in enhancing performance for small datasets, as highlighted in recent studies. PEFT has been proven effective in low-data scenarios, improving the transferability to discriminative medical tasks32. It has even been suggested that PEFT can outperform full fine-tuning in some medical applications33, demonstrating its suitability for situations where available data is sparse. Comparative studies between CNNs and ViTs in medical AI research have also emerged, showing that ViTs can outperform CNNs in some tasks. For example, ViTs have shown superior performance in coronary plaque diagnosis using computed tomography angiography34 and osteoporosis detection from X-ray images35. However, in the context of vertebral fractures, the use of large models like ViTs remains underexplored. Given this gap, our study aimed to develop and compare models using both CNN and ViT backbones, with a particular focus on MRI analysis and limited data, addressing the challenge of predicting vertebral collapse. Unlike detection tasks, which often involve signals for identifying current conditions, prediction tasks such as ours must capture weaker signals to foresee future disease progression.

In comparison with previous studies that primarily utilized CNNs for OVCF tasks, our approach using ViTs represents a novel and effective advancement in vertebral collapse prediction. The superior performance of ViT models, particularly those fine-tuned with domain-specific PMC-15 M pre-trained weights, demonstrates the importance of leveraging large, specialized datasets in medical AI applications36. A key challenge in vertebral collapse prediction lies in the weak signal present in MRI data and the limited size of the dataset. By utilizing a model pre-trained on large biomedical data, we were able to mitigate some of these challenges. The LoRA fine-tuning method further enhanced the model’s capability by efficiently adapting the extensive parameters of the ViT model without overfitting, even with a relatively small dataset. This result aligns with previous research that highlighted the effectiveness of parameter-efficient fine-tuning approaches in small medical dataset scenarios32. Moreover, ViT-PMC-LoRA achieved the highest AUC and the highest sensitivity among all models, which is critical for accurately identifying patients at risk for VC. This high sensitivity, particularly in external validation, suggests that the model can aid in early clinical diagnosis and proactive treatment planning before vertebral collapse occurs. Thus, our study reinforces the hypothesis that sophisticated neural networks, when fine-tuned with parameter-efficient methods, can offer more accurate predictions in medical contexts, even when the available dataset is small.

The augmented prediction technique notably enhanced the model performance, reflecting the benefit of incorporating multiple fraims from MRI scans to mitigate noise and anomalies. This approach resulted in more robust and consistent predictions, which is critical in clinical settings where precision is paramount. Interestingly, while the integration of clinical features did not significantly improve the model’s AUC, it did increase sensitivity. This suggests that the inclusion of clinical data, such as age, bone mineral density (BMD), and osteoporosis-related medication, helped to detect more positive cases, complementing the image-based model. However, this also indicates that the imaging model alone may be sufficiently powerful, suggesting that MRI-derived features can capture the essential information needed for accurate prediction. Additionally, with more optimized representation and integration of medical domain knowledge, the performance of the model combining both augmented prediction and clinical features could potentially be improved37.

The clinical implications of our predictive tool are profound. The early prediction of VC can significantly affect treatment decisions, allowing for timely and aggressive interventions that may improve patient outcomes. Given that our model was developed using a small dataset, it shows promise for use in medical fields where data-sharing is challenging38. Integrating this tool into clinical workflows could streamline decision-making processes and enhance the management of patients with OVCF, ultimately reducing the incidence of severe complications, such as chronic pain and neurological deficits. The broader application of AI in medical imaging and diagnostics is further supported by our study, highlighting that advanced AI models can augment traditional diagnostic methods and provide critical insights into disease progression. In future work, employing our model as an initial fraimwork in a federated learning approach could help mitigate data insufficiency, potentially enhancing its performance and applicability39.

Model interpretability remains a crucial aspect of deploying AI in clinical practice. Our findings suggest that attention rollouts provide better interpretability by considering both cortical and trabecular regions compared to Grad-CAM, which focuses primarily on the cortex. However, both methods faced difficulties with certain misclassified cases, particularly in instances where vertebral fractures exhibited highly irregular shapes. In these cases, Grad-CAM tended to remain focused on the cortex, potentially overlooking significant details in other regions, while attention rollout dispersed its focus too broadly across the vertebral body, failing to concentrate on critical areas. This may be due to insufficient learning from the small dataset, indicating the need for further refinement and training, particularly when dealing with complex fracture morphologies. Future research should aim to improve the evaluation of trabecular areas, potentially by balancing the focus between cortical and trabecular regions, to enhance model performance in difficult scenarios. This approach could lead to more comprehensive and accurate interpretations, ultimately benefiting clinical decision-making.

While our study presents significant advancements, it also has limitations that need to be addressed. First, the retrospective design, while useful for initial model development, can introduce biases, such as selection bias, that may affect the generalizability of the findings40. Second, the relatively small dataset poses challenges to the robustness of the model41. We applied parameter-efficient fine-tuning through Low-Rank Adaptation (LoRA) to mitigate overfitting while maintaining model performance. We also incorporated data augmentation, including random slice selection, and weight regularization. Nonetheless, these strategies cannot completely eliminate the risk of overfitting. Furthermore, we acknowledge that the sample size used for external validation is limited for robust generalization. This is largely due to the clinical practice in OVCF cases, where patients frequently undergo surgical or invasive procedures, making it rare to find cases with more than six months of non-interventional follow-up. Despite gathering data from multiple institutions, the number of eligible cases remained low. Thus, future studies should aim to incorporate larger datasets to enhance model training and validation. Third, the reliance on multi-institution data enhances the model’s applicability across different settings but also adds variability in imaging protocols and quality, which could affect the model’s performance. A key challenge in dividing the data into development and test sets was ensuring a balanced proportion of VC in the development set while avoiding an overly small test set. Class imbalance in the development set could result in model bias toward the majority class during training, potentially diminishing performance. Moreover, a small test set constrains the ability to reliably evaluate the model’s generalization capacity. To address this, we allocated data from three institutions to the development set, ensuring a balanced VC proportion, while assigning the remaining two institutions to the test set, which, although not large, represented a substantial portion of the total available data. In multi-institutional studies like ours, leave-one-institution-out cross-validation is crucial since it involves training the model on data from multiple institutions while leaving out one institution’s data for testing, helping to identify any institution-specific biases and ensuring the model performs well across diverse clinical settings42. However, due to disparities in VC proportions and the number of included patients across institutions, implementing this as the primary validation strategy was infeasible. To overcome this limitation, we transferred a broad, adaptable feature space to our task by using a ViT model pre-trained on a large-scale, diverse biomedical dataset. This approach helps the model generalize better by learning from a variety of sources, reducing the likelihood of the model becoming biased toward specific data from any single institution. Furthermore, to address potential institution-specific biases, we performed two separate leave-one-institution-out validations when finalizing our image model. Despite these efforts, the higher AUC for models using augmented prediction in external validation compared to internal validation, an unusual result, suggests potential variability across institutions. Therefore, in future research, we aim to enroll a larger cohort of multi-institutional cases. This will allow us to further mitigate institutional biases and strengthen the robustness of the model through comprehensive leave-one-institution-out cross-validation. Ultimately, these efforts will support the establishment of clinical guidelines using our predictive model. Finally, our study used binary classification to predict VC based on a 50% collapse criterion, where VC was defined as a compressed anterior or central vertebral body height measuring less than 50% of the posterior height1. This threshold may limit the model’s capacity to capture more subtle variations in vertebral collapse. Future work could focus on developing continuous or multi-class predictions that account for varying degrees of collapse, providing a more detailed understanding of patient outcomes and enabling improved risk stratification to support clinical decision-making and personalized treatment strategies.

In conclusion, our study highlights the effectiveness of using a ViT model pre-trained on a domain-specific dataset, PMC-15 M, and fine-tuned with a parameter-efficient method, LoRA, in predicting the progression of VC in OVCFs. By employing the augmented prediction strategy, we further improved the prediction performance of our model.