Vertebral Fracture Risk (VFR) Score For Fracture Prediction in Postmenopausal Women
Vertebral Fracture Risk (VFR) Score For Fracture Prediction in Postmenopausal Women
Vertebral Fracture Risk (VFR) Score For Fracture Prediction in Postmenopausal Women
1007/s00198-010-1436-6
ORIGINAL ARTICLE
Vertebral fracture risk (VFR) score for fracture prediction in postmenopausal women
M. Lillholm & A. Ghosh & P. C. Pettersen & M. de Bruijne & E. B. Dam & M. A. Karsdal & C. Christiansen & H. K. Genant & M. Nielsen
Received: 25 January 2010 / Accepted: 2 September 2010 / Published online: 11 November 2010 # International Osteoporosis Foundation and National Osteoporosis Foundation 2010
Abstract Summary Early prognosis of osteoporosis risk is not only important to individual patients but is also a key factor when screening for osteoporosis drug trial populations. We present an osteoporosis fracture risk score based on vertebral heights. The score separated individuals who sustained fractures (by follow-up after 6.3 years) from healthy controls at baseline.
M. Lillholm (*) : A. Ghosh : E. B. Dam : M. Nielsen Synarc Imaging Technologies A/S, Herlev, Denmark e-mail: martin.lillholm@synarc.com P. C. Pettersen Center for Clinical and Basic Research, Ballerup, Denmark M. de Bruijne : M. Nielsen University of Copenhagen, Copenhagen, Denmark M. de Bruijne Biomedical Imaging Group Rotterdam, Erasmus MC-University Medical Center Rotterdam, Rotterdam, The Netherlands M. A. Karsdal : C. Christiansen Nordic Bioscience A/S, Herlev, Denmark H. K. Genant Department of Radiology, University of California at San Francisco, San Francisco, CA, USA H. K. Genant Synarc, San Francisco, CA, USA
Introduction This casecontrol study was designed to assess the ability of three novel fracture risk scoring methods to predict first incident lumbar vertebral fractures in postmenopausal women matched for classical risk factors such as BMD, BMI, and age. Methods This was a casecontrol study of 126 postmenopausal women, 25 of whom sustained at least one incident lumbar fracture and 101 controls that maintained skeletal integrity over a 6.3-year period. Three methods for fracture risk assessment were developed and tested. They are based on anterior, middle, and posterior vertebral heights measured from vertebrae T12-L5 in lumbar radiographs at baseline. Each scores fracture prediction potential was investigated in two variants using (1) measurements from the single most deformed vertebra or (2) average measurements across vertebrae T12-L5. Emphasis was given to the vertebral fracture risk (VFR) score. Results All scoring methods demonstrated significant separation of cases from controls at baseline. Specifically, for the VFR score, cases and controls were significantly different (0.670.04 vs. 0.350.03, p <10 6) with an AUC of 0.82. Dividing the VFR scores into tertiles, the fracture odds ratio for the highest versus lowest tertile was 35 (p <0.001). Sorting the combined casecontrol group according to VFR score resulted in 90% of cases in the top half. Conclusion At baseline, the three scores separated cases from controls and, especially, the VFR score appears to be predictive of fractures. Control experiments, however also, indicate that VFR-based fracture prediction is operator/ annotator dependent and high-quality annotations are needed for good fracture prediction Keywords Fracture prediction . Lumbar spine . Postmenopausal women . Vertebra shape . Vertebral morphometry
2120
Introduction Postmenopausal osteoporosis remains a serious condition affecting millions of individuals worldwide. Current epidemiological evidence suggests that in industrialized countries, approximately 40% of postmenopausal women at the age of 60 and as many as 70% of women at the age of 80 suffer from osteoporosis [1]. Postmenopausal osteoporosis is characterized by a reduction in bone mass due to increased bone resorption and a simultaneous but less pronounced increase in bone formation, resulting in negative net calcium balance. This ongoing process fueled by chronic estrogen deficiency may eventually lead to microarchitectural osteoporosis, possible fractures, and substantial deterioration in the quality of life. The cardinal feature of osteoporosis is the occurrence of fragility fractures, typically in the spine, but also in the forearm and hips. Whereas limb fractures are easy to diagnose, the case is different for the spine region, where mild vertebral fractures are often asymptomatic. Though the mortality rate from osteoporotic fractures is the highest for those of the hip, vertebral fractures are the most common type of fragility fractures with an estimated occurrence of 750,000 cases per year in the USA [2]. Osteoporotic vertebral fractures typically occur earlier and are an established risk factor for hip fractures [3]. Presence of severe vertebral fractures has been associated with acute and chronic pain, impaired quality of life, increased risk of osteoporotic limb fractures, and shortened life expectancy [4]. There is, therefore, a continuing interest in identifying independent predictors of vertebral fractures that could facilitate the detection of high-risk patients, who would benefit the most from early prevention. Vertebral fractures are often diagnosed and graded by experienced radiologists using qualitative [5] or semiquantitative methods such as those described by Genant et al. [6] and others [711]. The methods were shown to be robust to intra-observer variations but may be difficult to apply uniformly across different clinical centers. More importantly, it is yet to be decided which one of these semiquantitative methods should be used as a gold standard [12]. In order to overcome some of these problems, fully quantitative methods have been developed [1323]. One of the shortcomings of the discrete nature (due to the use of thresholds) of most of these methods is the inability to quantify subtle changes in the vertebral shape. Hence, a more robust and detailed study of the vertebral/spine shape abnormalities should produce (1) an objective quantifiable measure for detection and severity-grading of fractures and (2) details of pre-fracture vertebral-shape changes that lead to better prediction of osteoporotic fractures. The present study investigated whether computer-based measures of fracture risk, calculated using vertebral prefracture shape variations, could differentiate healthy sub-
jects who later sustain a vertebral fracture from those who maintain vertebral integrity when matched for of an array of traditional risk factors, including bone mineral density (BMD). The rationale behind such an investigation is that the detection of pre-fracture conditions and successful prediction of vertebral fragility fractures will help the study of osteoporosis in the following important ways: (1) early diagnosis and treatment for patients, (2) more precise assessment of efficacy of fracture prevention drugs by the identification of subjects with a high likelihood of sustaining an osteoporotic fracture, and (3) decrease in the required sample size for clinical studies by inclusion of high-risk subjects. In this paper, we pursue parts of the second and third objectives and present quantitative analyses of pre-fracture vertebral-shape changes that yield subject scores indicating first-incidence lumbar vertebral fracture risk. It is demonstrated how this score can be used to rank a screening population and drive the selection of a net study population with increased average fracture risk.
Materials and methods A casecontrol study was designed such that case-group patients had no prevalent lumbar vertebra fractures and by follow-up 6.3 years later had sustained at least one fracture in the lumbar spine only. The control group maintained skeletal integrity throughout and was matched with respect to an array of traditional risk factors. The study population was chosen from the PERF cohort [24] which consisted of 4,062 community-recruited postmenopausal Danish women first screened between 1992 and 1995 and subsequently reviewed between 2000 and 2001. PERF contained 662 patients with at least one new vertebral fracture at follow-up; 88 of the 662 had no prevalent fractures at baseline and of those, 25 suffered incident fractures in the lumbar region (T12 to L5) only and were selected as case patients. The case group was matched by a fracture-free 101 large control group also from the PERF cohort with comparable risk factors such as age, body mass index (BMI), family history of osteoporosis, alcohol and milk consumption, history of hormone replacement therapy, spine BMD, smoking habits, and self-reported physical exercise. Any patients with non-osteoporotic vertebral deformities or non-osteoporotic fractures were excluded before case and control groups were selected. At baseline, none of the 126 subjects displayed any sign of disorders of calcium metabolism or bone disease, or took any medication known to affect bone metabolism. All subjects were interviewed to obtain information on their medical history, use of medication, and other life style factors. Subjects underwent a complete physical examination; weight and height were determined without shoes and with the subjects
2121
wearing light indoor clothes, and the BMI was calculated. BMD of the spine was determined by bone densitometry using a Lunar Prodigy scanner and lateral radiographs of the lumbar spine were taken of the patients using a standard technique [24]. Written consent was obtained from each participant according to the Helsinki Declaration II. The study was approved by the local ethics committee. Spine radiographs acquisition and fracture assessment Spinal examinations were performed according to preapproved protocols. Radiographs of the lumbar region were taken for each of the subjects at baseline and follow-up. In the lateral position, pillows were used to ensure good alignment of the vertebral bodies. The distance between the focal plane and the film was kept constant at 1.2 m and the central beam was directed to L2. Patients were asked to hold in their expiration for the duration of the radiograph acquisition. The same group of staff examined each of the subjects. Anterior posterior (AP) radiographs were used for a general view and assessment of vertebral deformities (i.e., scoliosis). Obvious vertebral fractures in AP radiographs were noted but the primary fracture assessment was performed on lateral radiographs. Fracture assessment and classification from the original PERF study [24] was re-evaluated and confirmed using Genants semi-quantitative method [6] by PP who was trained in and had several years of experience using the method. Baseline and follow-up radiographs were viewed simultaneously to avoid confusing fracture incidence with undetected or borderline prevalence. The primary utility of the fracture assessment was to establish baseline and followup fracture presence. These readings were used to identify the case (n =36) and controls group (n =108) in the paper by
Fig. 1 a An example of a lateral lumbar radiograph with six-point annotations in red and vertebra labels in black. The midpoints are always marked on the lower of the endplate contours. b The shape of vertebrae for various values of Hmax, Hmin compared to Genants height ratio. For illustration purposes, the smallest height Hmin is placed to the left and the largest height Hmax is placed to the right in each vertebra. Each vertebra has the corresponding height ratio noted. The blue area indicates the high-risk patients according to VFR. The green area indicates the high-risk patients according to Genants height ratio and as depicted different patients may be indicated as high-risk patients
Pettersen [25] in which the fracture risk predictive power of a computer-based measure of curvature irregularity was investigated. The case and control groups in this study were specifically selected as subsets of the Pettersens case and control group as described below. Digitization of radiographs and six-point annotations All lateral radiographs were digitized at 45 m (570 DPI). For further analysis of the images, six points (called the height points) were placed at the corners and at the middle points of the vertebral endplates, by the same radiologist using a computer program; see Fig. 1a. Using these measured heights, all vertebrae were evaluated by a computer algorithm using a modification of Genants methodology with a strict measured threshold of 0.2 as fracture absence/presence indicator, that is, a vertebra was considered fractured if either of the ratios between any of the anterior, middle, and posterior heights was 0.8 or less. Subpopulation and borderline fractures The Pettersens [25] population was reduced selecting only subjects where the quantitative fracture classification was in agreement with the experts fracture assessment. This procedure reduced the case and controls groups described above to 25 cases and 101 controls, respectively. The additional step deliberately filtered out subjects where there was borderline disagreement between the SQ- and QMbased fracture assessments. This prevented the idea that the fracture prediction results reported in this paper were influenced by such disagreements.
T12 L1 L2
L3
L4
L5
2122
Computer-based prediction of vertebral fractures Three quantitative scores based on vertebral height morphometry were developed to predict first incident lumbar fractures. The scores were named the vertebral fracture risk (VFR), the most deformed height ratio (MDHR), and the most deformed height anterior height posterior ratio (MDHaHp). Each of the three scores was tested in two versions: (1) only the most deformed vertebra determined the score and (2) the average over vertebrae T12-L5 determined the score. In the following, we initially describe the VFR score on the most deformed vertebra in detail. The remaining two scores and the average versions of all scores use the same overall methodology as the VFR score and only deviations from this are presented. Computation of the VFR score consisted of two steps: (1) selection of the most deformed vertebra and (2) scoring of this vertebra to form a single score for the patient: Step 1 Selection of the most deformed vertebra: For a given patient, the vertebral height ratios of the smallest to the largest anterior, middle, or posterior heights were computed for all vertebrae. The vertebra among T12 to L5 with minimal height ratio was denoted the most deformed and chosen to represent each patient. Due to the patient selection described above, all ratios were above 0.8. Step 2 VFR scoring: Each selected vertebra was represented by the maximal Hmax and minimal Hmin of its three vertebral heights. The selected heights were normalized by the mean of the vertebral heights in question:
Hmax maxfHant ;Hmed ;Hpost g=meanfHant ;Hmed ;Hpost g; Hmin minfHant ;Hmed ;Hpost g=meanfHant ;Hmed ;Hpost g
Consequently, Hmax 1.0 and Hmin 1.0 for any vertebra. The relationship between Hmax, Hmin, and height ratios as used in, for example, Genants method is illustrated in Fig. 1b. Each patient in the case and control groups was represented by their normalized height pair. Vertebral height pairs in each group were assumed to be normally distributed, that is following a bivariate normal distribution in (Hmax, Hmin). The empirical mean and covariance matrix were estimated as standard maximum likelihood estimates for each of the case and control groups and the likelihood P of belonging to each group expressed as: P Pcase Hmax ; Hmin N mcase ; case P Pcontrol Hmax ; Hmin N mcontrol ; control
of belonging to the case group was computed from the estimated normal distributions. This relative likelihood ratio was defined to be the VFR score. It is, as constructed, a number between 0 and 1 representing the probability of sustaining a fracture. It should be expected that a patient with a clear prevalent fracture will have a VFR score close to 1; namely that the chance of sustaining a future fracture is trivially very high unless the already fractured vertebra is excluded from the score calculation. The VFR for all patients was computed in a leave-one-out procedure to avoid bias and underestimation of the variance. The explicit modeling of shape variations through bivariate normal distributions as opposed to single thresholds on, say, height ratios allowed for a fuller representation of both normal and fracture-prone shape variation. The remaining two scores are calculated using the same overall methodology as the VFR score with the following exceptions: For MDHR, a vertebra was represented by the minimal height ratio and the most deformed representative selected as described above. For the MDHaHp ratio, the minimal height ratio was calculated based on the anterior and posterior heights only but was otherwise identical to MDHR. This means that both the MDHR and MDHaHp scores have one-dimensional representations (ratios) and the fitted normal distributions were thus univariate. Furthermore, the explicit height normalization step from the VFR score was not needed due to the implicit normalization through ratios. All three scores were also tested in mean versions (MVFR, MHR, and MHaHp) where a patient was represented as the mean of height pairs or ratios over vertebrae T12-L5 instead of the single most deformed vertebra as described above. We compared these morphometric prognostic markers based on analysis of the individual vertebrae with the irregularity measure based on spine curvature suggested for fracture prediction by Pettersen [25]. Any comparisons with the Pettersens method are reported on the reduced dataset presented in this paper for both ours and Pettersens method. Method for high-risk clinical study screening The high-risk population selection-mechanism outlined below was aimed at maximizing the number of subjects most likely to sustain first incident lumbar vertebral fractures in a trial population selected from a larger screening population. Population selection is only described for the VFR score but could also be realized using any of the other five proposed scores. The subpopulation selection-mechanism consisted of three steps: (1) scoring all subjects from the gross
2123
population using VFR, (2) sorting them in descending VFR order, and (3) selecting the required number of subjects; here we selected the 50% with the highest VFR score. The selection was evaluated on the same study population as the scoring methods outlined above. Statistical analysis Results are presented as mean SEM unless otherwise specified. The scores of different groups of subjects were compared using the non-parametric MannWhitney U test. Differences were considered statistically significant if p values were less than 0.05. The ability to separate cases from controls was further characterized through the area under the ROC curve (AUC). Significance of differences between AUC was tested with Delongs method [26]. Odds ratios (ORs) and 95% confidence intervals between highest and lowest tertiles are reported for each method and the significance of differences between ORs was tested using Tarones variant of the BreslowDay test [27]. As a test of the BMD influence on the predictive value of the VFR score, logistic regression with the VFR score and BMD as independent variables was used. The subpopulation selection result is reported as the relative number of cases above the median of the VFR scores across both the case and control datasets. To assess the inter-annotator stability of suggested VFR score, the 126 baseline radiographs were six-point annotated (T12-L5) an additional two times by two experienced X-ray technicians. We report mean SEM, AUC, and ORs including significance levels for the two repeat annotations where the VFR score was trained on the original annotation. To assess the inter-observer stability of the suggested high-risk population selection methodology, we further report the percentage of cases in the top half of the VFR ordered dataset for each repeat annotation. We report an overall simulated estimate of expected performance for repeat annotations through an estimate of the annotation scatter observed across the three annotators. The repeat annotations yielded a total of 126 6 6 4; 500 vertebrae points annotated by three trained observers. These data were used to estimate the mean and standard deviation for inter-observer annotation variability for x and y annotation coordinates separately. The original full dataset was subsequently perturbed with normally distributed variations in the x and y directions according to the estimated means and standard deviations. The main experiment described above was repeated using the perturbed dataset. This random perturbation and subsequent experimentation was repeated 50 times. We report the median AUC and 95% confidence intervals and the median
percentage of cases in the top half over the 50 perturbation trials. The percentage of cases where the most deformed vertebra at baseline is one of the fractured vertebrae at follow-up is reported. This number is supplemented by the percentage of cases where this would be observed by chance. Finally, we report results of the main experiment on the full Pettersen [25] dataset and compare to results achieved on the reduced dataset (see page 7) used throughout. All data were analyzed using Matlab (Mathworks, USA).
Results Study population The skeletal and demographic characteristics of the case and control groups are presented in Table 1 where the main statistics reported for each group is the median value; mean SD is given in parentheses for completeness. Based on BMD measurements both the case and control groups contained approximately half normal (non-osteoporotic) and half osteopenic patients at baseline; furthermore both groups contained two osteoporotic patients at baseline. Fracture prediction The three suggested morphometric fractures prediction scores based on the single most deformed vertebra VFR, MDHR, and MDHaHp showed significant differences between case and control patients at baseline: VFR (0.670.04 vs. 0.35 0.03; p <106) (Fig. 2), MDHR (0.630.04 vs. 0.380.02; p <105), and MDHaHp (0.630.04 vs. 0.380.02; p <105), respectively. The mean variants of the three scores also had significant differences between case and control groups at baseline: MVFR (0.570.04 vs. 0.410.02; p <103), MHR (0.580.04 vs. 0.380.02; p <105), and MHaHp (0.58 0.04 vs. 0.380.02; p <105). The ROC curves for the three MD scores and Pettersens irregularity (Irr) score are shown in Fig. 3a. The AUCs were: VFR (0.82), MDHR (0.80), MDHaHp (0.79), and Irr (0.67). Figure 3b additionally shows the ROC curves for the mean variants which have AUCs: MVFR (0.73), MHR (0.74), and MHaHp (0.76). Based on Delongs test, the AUC was significantly different between VFR and Pettersens irregularity measure (p =0.02), was of borderline significance between the VFR and MDHR (p =0.093) methods, and did not differ between the VFR and MDHaHp methods (p =0.22). Similarly for the mean variants, the AUC was significantly different between the VFR and each of the MVFR (p =0.02) and MHR (p = 0.03) methods and did not differ between the VFR and MDHaHp (p =0.10) methods.
Osteoporos Int (2011) 22:21192128 Cases (n =25) Median (mean SD) 64.7 (66.95.4) 16.8 (19.18.2) 162 (1634.6) 64.5 (68.311.7) 24.8 (25.54.0) 0.79 (0.810.14) Fracture count 4 15 5 2 4 1 Number of patients 19 6 0 Controls (n =101) Median (mean SD) 65.4 (66.65.9) 16.6 (17.78.0) 161 (1616.0) 65.4 (65.48.4) 24.6 (25.03.2) 0.84 (0.860.14)
Age (p =0.98)a Years since menopause (p =0.56) Height (cm; p =0.21) Weight (kg; p =0.53) BMI (kg/m2; p =0.72) Baseline spine BMD (g/cm2; p =0.2) Distribution of fractures at follow-up T12 L1 L2 L3 L4 L5 Number of fractures at follow-up 1 2 3
There is no significant difference between the matched case and controls group as demonstrated by a MannWhitney U test
Neither MDHR nor MDHaHp was significantly better than any of the mean variants. The highest versus lowest tertile ORs with 95% confidence intervals for the three suggested methods (both most deformed and mean variants), Pettersens irregularity measure, and BMD are given in Fig. 4. The VFR ORs was
Fig. 2 VFR scores for the case and control groups at baseline and follow-up. VFR is significantly higher at baseline for the subjects which were going to sustain fractures at follow-up. Furthermore, there are significant differences between baseline and follow-up for both the case and the control group. Significance levels indicated: *p <0.05; ***p <0.001
significantly more predictive than Irregularity or BMD alone (p =0.03 and p =0.004). Additional results are given for the VFR score only: Figure 5 is a box and whisker plot of the VFR scores for the case and control group at baseline. There was a significant difference between baseline and follow-up VFR scores for both the case and the control groups: case (0.670.04 vs. 0.990.01; p <108) and control (0. 350.03 vs. 0.44 0.03; p =0.04), respectively. Finally, the difference between case and control VFR scores at follow-up was significant (0.990.01 vs. 0.440.03; p <1013). See Fig. 2 for a summary of VFR scores. Logistic regression with BMD and the VFR score as independent variables confirmed the BMD-independent (p <0.001) nature of the VFR score. Using the VFR score to select a high-risk subpopulation from the combined casecontrol population yielded 90% of cases in the top half of the total population. The case vs. control VFR scores based on the most deformed vertebra for the two repeat annotations (using the original annotation as training data) were: repeat (1) 0.63 0.06 vs. 0.330.03; p =0.048 and repeat (2) 0.610.06 vs. 0.47 0.03; p = 0.047. The corresponding AUCs and DeLong test p values for difference of AUC from 0.5 were: (1) 0.63, p =0.026 and (2) 0.63, p =0.037. Finally, the highest versus lowest tertile ORs were (1) 2.88 (1.08.4) and (2) 3.21 (1.28.6). The high-risk subpopulation selection resulted in 60% and 64% of cases in the top half of the total population for the two repeat annotations respectively.
2125
VFR MVFR MDHR MHR 7.7 7.0 11.9 1.7 3.2 6.3 15.8
35.7
0.75
(Sensitivity)
MDHaHp
0.5
MHaHp BMD
0.25
Irr
10
10
Odds Ratio
(1-Specificity)
0.75
Fig. 4 Odds ratios with 95% confidence intervals for the three suggested scores in both the most deformed and mean variants, Pettersens Irregularity score, and BMD. All odds ratios were calculated between highest and lowest tertile for each predictor
(Sensitivity)
The AUC and ORs for the main experiment repeated on the full Pettersen dataset [25] were: AUC, 0.84 (0.770.90); p <1010 and ORs, 52 (11244); p <104.
0.5
Discussion The curvatures of the spine are produced by differences in anterior and posterior vertebral heights, by differences in vertebral disc heights and by postural musculoskeletal
1
1
0.25
0.25
0.5
0.75
0.75
VFR score
Fig. 3 ROC curves. a Contains ROC curves for the most deformed variant of the three suggested scores and Pettersens Irregularity score [25]. b Compares VFR in its most deformed variant to the mean variants of the three suggested scores (MVFR, MHR, and MHaHp)
0.5
Across the 50 datasets perturbed by typical interannotator variation, the median AUC with 95% confidence intervals was 0.73 (0.610.82) and the median percentage of cases in the top half of the total population was 76%. The most deformed vertebra at baseline was one of the fractured vertebrae at follow-up for 32% of the case subjects. With an average of 1.2 fractures (Table 1) per case subject, this would happen for 21% of cases by chance.
0.25
Case
Control
Fig. 5 Box and whisker plots for the case and control groups at baseline. Horizontal box lines are located at the upper quartile, median, and lower quartile. Whiskers extend to the last sample less than 1.5 times the inter-quartile range and outliers are indicated by red dots. The box indentations indicate uncertainties of the median estimate
2126
forces. In the lumbar spine, the convexity is produced partly by differences in growth of the anteriorposterior parts of the vertebrae [28]. This leads to slight differences in the heights usually confined to an acceptable range. Subsequent changes in vertebral heights can lead to changes in the spinal curvature and to a redistribution of forces upon the vertebrae endplates. A vertebra is expected to fracture when the loads imposed are similar to or greater than its strength [29]. From this we expect that an unfractured lumbar vertebra that presents an abnormal change of one or more of three vertebral heights, due to, for example, osteoporosis, is more likely to fracture or cause fractures than the one which keeps within the normal range of shape variations. Inspired by this, three morphometric computer-based scores, each in two variants, were trained to predict lumbar fractures in postmenopausal women through modeled variability of measured anterior, middle, and posterior vertebral heights in lumbar vertebrae. They were applied in a casecontrol study matched for BMD and the other major risk factors for osteoporosis. All three methods based on the most deformed vertebra were able to significantly differentiate subjects who would sustain at least one lumbar vertebral fracture from those who maintained vertebral integrity over a 6.3-year period. Furthermore, the results suggest that the variants based on the most deformed vertebra produced better fracture predictions than the variants based on means across vertebrae T12-L5. Specifically, the most deformed VFR variant yields significantly better prediction than the MVFR and the MHR and a Delong p value of 0.1 for MHaHp. Conversely, neither of the other two scores based on the most deformed vertebra was significantly better than VFR or any of the mean variants. This suggests that the VFR score, which is based on two normalized height measurements and thus 2 df, measures more diverse shape variations (Fig. 1b) than any of the suggested 1 df (ratio) scores. Furthermore, morphometry-based lumbar fracture prediction using the single most deformed vertebra is stronger than prediction based on an average or summary across the lumbar spine. Finally, VFR delivers significantly better fracture prediction than the curvature-based irregularity measure suggested by Pettersen [25]. Although, L5 was included in this analysis to make a direct comparison with Pettersens work possible, the suggested methods (including VFR) are directly applicable to datasets where only T12-L4 are annotated in lumbar radiographs. Experiments (not included) on the current data-set indicated a slight but not statistically significant performance drop if only T12-L4 were included. This is supported by the relatively larger morphological and annotation induced variation of L5; in the context of this work most likely too large to consistently add significant information.
The VFR score showed an increased risk for sustaining a second fracture in the case-group through a significant difference between cases at baseline and follow-up. This is in accordance with the literature [30, 31] and our expectations in designing the VFR score. There was also a significant difference for control-subjects measured at baseline and follow-up pointing to an overall increase in risk for an incident fracture. Although the control group remained fracture free, this increase in risk is not unexpected over a 6.3 years observation period of postmenopausal women where half were osteopenic at baseline. Our findings, on selection of high-risk subpopulations were that a clear majority of the case-patients was found in the top half of the combined cases and controls group sorted by VFR. This suggests VFR as a viable supplement to BMD and standard risk factors to select fracture free but likely to fracture subjects from a general screening population. The discriminative performances of the two repeat annotations were not as high as the original annotation but still significant and well within the confidence intervals of simulated performance using estimated annotator scatter. The same was the case for high-risk subpopulation selection performance for both the individual repeat annotations and the simulation. These performance drops are not unexpected as the repeat sets were scored using the original set as training/reference data and not in a leaveone-out fashion and are in this sense less biased. Furthermore, the two repeat sets were annotated by trained X-ray technicians and not radiologists as was the case for the original set. Indirectly, this is reflected by the observed larger SEMs of VFR scores within the case group for the repeat annotators compared with the original annotation. That fracture prediction of the suggested kind is more sensitive to annotation quality and variation than, say, standard SQ or QM-based fracture classification is, as mentioned, not surprising. The discriminative factors of VFR are subtle pre-fracture shape changes that are smaller than standard implicit or explicit SQ and QM thresholds [6]. Based on this, we emphasize that high quality, consistent expert annotations are important in achieving good separation. This is especially the case if the methods were applied to less matched populations as in, say, clinical trial screening. Here, one would need to train the algorithm on a similar (same inclusion criteria) but independent reference population prior to screening of novel subjects. In that scenario, we would not expect discriminative performance significantly above the reported median simulated performance. The vertebrae that fractured were also the most deformed vertebrae for 32% of the cases which is somewhat higher than was expected by chance alone (21%). This suggests
2127
that a high VFR score indicates an overall lumbar fracture risk in terms of, e.g., an uneven biomechanical load distribution or overall systemic effects indirectly signaled by the most deformed vertebra. Finally, the validation of the experiment on the full Pettersens population [25] confirmed that the performed population reduction to avoid predictive performance based on borderline disagreement between SQ and QM did not lead to improved results as expected. Limitations of the study The subjects used in this study were all communityrecruited postmenopausal women roughly equally split between normal and osteopenic. The case group was pre-selected from a gross population of 4,062 as fracture free at baseline and with at least one lumbar only fracture at follow-up. The control group was matched with respect to traditional osteoporosis risk factors and maintained skeletal integrity throughout. Any patients with non-osteoporotic deformities and/or fractures were excluded whether these were real or caused by projection errors. The main results of this paper, although promising, must be validated in future studies to establish applicability on populations with fewer restrictions than described above. This study focused on fracture risk in the lumbar region (including T12). This was primarily done to facilitate easy comparison with the related results in the Pettersens paper [25] but also in recognition of the fact that pre-fracture shape changes are more pronounced in the relatively larger lumbar vertebrae. A natural generalization in future studies will be to test if pre-fracture shape changes in thoracic vertebrae also have fracture-predictive value. Furthermore, thoracic and lumbar vertebrae are very likely to give a better combined prediction of both lumbar and overall fracture risk. It is our expectation that a single linear model as presented in this paper would be insufficient to model the range of combined lumbar and thoracic morphological variance. We, however, believe that two such models trained on thoracic and lumbar vertebrae respectively, in combination would be sufficient. For the purposes of the pre-selection and the study in general, fracture absence and presence was established using Genants semi-quantitative method [6] by both a radiologist and subsequently by a computer algorithm using manually measure vertebral heights with a strict fracture threshold 0.2. In future studies, it would be relevant to establish how robust the proposed fracture risk score are with respect to changing the ground truth fracture assessment methodology to other established methods as suggested by, e.g., Eastell, Melton, Minne, and McCloskey [13, 18, 19, 23].
Conclusion We have presented three morphometric scores each in two variants. In a case control study based on community-recruited postmenopausal women with the limitations iterated above, all six variants were able to predict first incident lumbar fractures. In general, variants based on only measuring the single most deformed vertebra showed more promise than the mean variants. More specifically, the 2 df VFR score is significantly better than two other mean-based variants but not significantly better that the other two suggested single vertebra morphometric methods. Subject to the limitations iterated above and the availability of highquality radiologist six-point annotations, we conclude that relative vertebral heights or height ratios measured on the single most deformed vertebra used in combination with machine learning techniques appears a promising approach for (1) first incident lumbar fracture prediction and (2) selection of fracture-prone populations for clinical trials investigating treatment and prevention of osteoporosis and osteoporotic fractures. It is, however also, clear that VFR-based fracture prediction is highly operator/annotator dependent; this and a generalization to both lumbar and thoracic fracture prediction will be the focus of future studies.
Acknowledgements The authors gratefully acknowledge the funding from the Danish Research Foundation (Den Danske Forskningsfond) supporting this work. The authors thank Jane Petersen and Annette Olesen for the repeat annotations. Conflicts of interest The VFR-methodology is part of a pending patent. Martin Lillholm is an employee of Synarc Imaging Techonologies/Nordic Bioscience Imaging (SIT/NBI). Anarta Ghosh is a former employee of SIT/NBI. Paola C Pettersen is an employee of Center for Clinical and Basic Research (CCBR). Erik B Dam is an employee of SIT/NBI. Morten A Karsdal is an employee and shareholder of Nordic Bioscience (NB). Claus Christiansen is an employee and shareholder of NB and CCBR. Harry K Genant is an employee and shareholder of Synarc. Mads Nielsen is partly funded by SIT/NBI. Marleen de Bruijne was previously funded by Nordic Bioscience.
References
1. Rodan GA, Martin TJ (2000) Therapeutic approaches to bone diseases. Science 289:1508 2. Watts NB (2001) Osteoporotic vertebral fractures. Neurosurg Focus E12:10 3. Kanis JA, Borgstrom F, De Laet C, Johansson H, Johnell O, Jonsson B, Oden A, Zethraeus N, Pfleger B, Khaltaev N (2005) Assessment of fracture risk. Osteoporos Int 16:581589 4. Truumees E (2003) Medical consequences of osteoporotic vertebral compression fractures. Instr Course Lect 52:551558 5. Jiang G, Eastell R, Barrington NA, Ferrar L (2004) Comparison of methods for the visual identification of prevalent vertebral fracture in osteoporosis. Osteoporos Int 17(11):887896
2128 6. Genant HK, Wu CY, van Kuijk C, Nevitt MC (1993) Vertebral fracture assessment using a semiquantitative technique. J Bone Miner Res 8:11371148 7. Ferrar L, Jiang G, Adams J, Eastell R (2005) Identification of vertebral fractures: an update. Osteoporos Int 16:717728 8. Grados F, Roux C, de Vernejoul MC, Utard G, Sebert JL, Fardellone P (2001) Comparison of four morphometric definitions and a semiquantitative consensus reading for assessing prevalent vertebral fractures. Osteoporos Int 12:716722 9. ONeill TW, Felsenberg D, Varlow J (2004) Diagnosis of osteoporotic vertebral fractures: importance of recognition and description by radiologists. Am J Roentgenol 183:949958 10. Stone J, Gurrin LC, Byrnes GB, Schroen CJ, Treloar SA, Padilla EJ, Dite GS, Southey MC, Hayes VM, Hopper JL (2007) Mammographic density and candidate gene variants: a twins and sisters study. Cancer Epidemiol Biomark Prev 16:14791484 11. Nielsen VAH, Pdenphant J, Martens S, Gotfredsen A, Riis BJ (1991) Precision in assessment of osteoporosis from spine radiographs. Euro K Radiol 13:1114 12. Delmas PD, Langerijt L, Watts NB, Eastell R, Genant H, Grauer A, Cahall DL (2005) Underdiagnosis of vertebral fractures is a worldwide problem: the IMPACT study. J Bone Miner Res 20:557563 13. Eastell R, Cedel SL, Wahner HW, Riggs BL, Melton LJ (1991) Classification of vertebral fractures. J Bone miner Res (Print) 6:207215 14. Black DM, Palermo L, Nevitt MC, Genant HK, Epstein R, San Valentin R, Cummings SR (1995) Comparison of methods for defining prevalent vertebral deformities: the Study of Osteoporotic Fractures. J Bone Miner Res 10:890902 15. Davies KM, Recker RR, Heaney RP (1989) Normal vertebral dimensions and normal variation in serial measurements of vertebrae. J Bone Miner Res 4:341349 16. Jensen KK, Tougaard L (1981) A simple X-ray method for monitoring progress of osteoporosis. Lancet 2:1920 17. Kleerekoper M, Parfitt AM, Ellis BI (1984) Measurement of vertebral fracture rates in osteoporosis. Copenhagen Int Symp Osteoporos 1:103108 18. Melton LJ, Kan SH, Frye MA, Wahner HW, OFallon WM, Riggs BL (2005) Epidemiology of vertebral fractures in women. Am J Epidemiol 129:10001011 19. Minne HW, Leidig G, Wuster C, Siromachkostov L, Baldauf G, Bickel R, Sauer P, Lojen M, Ziegler R (1988) A newly developed spine deformity index (SDI) to quantitate vertebral crush fractures in patients with osteoporosis. Bone Miner 3:335349
Osteoporos Int (2011) 22:21192128 20. Reshef A, Schwartz A, Ben Menachem Y, Menczel J, Guggenheim K (1971) Radiological osteoporosis: correlation with dietary and biochemical findings. J Am Geriatr Soc 19:391402 21. Ross PD, Yhee YK, He YF, Davis JW, Kamimoto C, Epstein RS, Wasnich RD (1993) A new method for vertebral fracture diagnosis. J Bone Miner Res 8:167174 22. Smith-Bindman R, Steiger P, Cummings SR, Genant HK (1991) The index of radiographic area (IRA): a new approach to estimating the severity of vertebral deformity. Bone Miner 15:137149 23. McCloskey EV, Spector TD, Eyres KS, Fern ED, ORourke N, Vasikaran S, Kanis JA (1993) The assessment of vertebral deformity: a method for use in population studies and clinical trials. Osteoporos Int 3(3):138147 24. Bagger YZ, Tanko LB, Alexandersen P, Hansen HB, Qin G, Christiansen C (2006) The long-term predictive value of bone mineral density measurements for fracture risk is independent of the site of measurement and the age at diagnosis: results from the prospective epidemiological risk factors study. Osteoporos Int 17:471477 25. Pettersen PC, de Bruijne M, Chen J, He Q, Christiansen C, Tanko LB (2007) A computer-based measure of irregularity in vertebral alignment is a BMD-independent predictor of fracture risk in postmenopausal women. Osteoporos Int 18(11):15251530 26. DeLong ER, DeLong DM, Clarke-Pearson DL (1988) Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44:837845 27. Tarone RE (1985) On heterogeneity tests based on efficient scores. Biometrika 72(1):9195 28. Zebaze RMD, Maalouf G, Maalouf N, Seeman E (2004) Loss of regularity in the curvature of the thoracolumbar spine: a measure of structural failure. J Bone Miner Res 19:10991104 29. Duan Y, Seeman E, Turner CH (2001) The biomechanical basis of vertebral body fragility in men and women. J Bone Miner Res 16:22762283 30. Hasserius R, Karlsson MK, Nilsson BE, Johnell O (2003) Prevalent vertebral deformities predict increased mortality and increased fracture rate in both men and women: a 10-year population-based study of 598 individuals from the Swedish cohort in the European Vertebral Osteoporosis Study. Osteoporos Int 14:6168 31. Lunt M, ONeill TW, Felsenberg D, Reeve J, Kanis JA, Cooper C, Silman AJ (2003) Characteristics of a prevalent vertebral deformity predict subsequent vertebral fracture: results from the European Prospective Osteoporosis Study (EPOS). Bone 33:505 513
Copyright of Osteoporosis International is the property of Springer Science & Business Media B.V. and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.