Regression Assignment
Regression Assignment
Regression Assignment
relweight variable calculated by dividing the Final Weight variable in the dataset by its mean.
2. Description of variables
For the regression model, the following five variables were used: Status of Health (Q2),
Number of cigarettes smoked per day (Q11A), Weight in Kilograms (Q80KG) , Number of
Years Smoked (DVYRSMKD ) and Sex of Selected Person (SEX) as a control variable. The
properties of the variables are described in detail below:
Status of Health (DV): This variable is measured on an ordinal, 5-point scale, but for
purposes of the regression model, it will be treated as continuous. Participants needed to
specify how well they think their health fares, with ‘1’ originally being Excellent, ‘2’ Very
Good, ‘3’ Good, ‘4’ Fair and 5 ‘Poor’. The variable was reverse-coded prior to analysis so
that a lower value would indicate a lower quality of health, to aid in interpretation of the final
model. The mean of this variable was 2.26 (SD = 1.002), while the median was 2. The two are
fairly close, which is due to the fact that the 5-point scale prevents extreme outliers that would
bias the mean from occurring.
Descriptives
Std.
Statistic Error
STATUS OF Mean 2,26 ,009
HEALTH 95% Confidence Lower 2,24
Interval for Mean Bound
Upper 2,28
Bound
5% Trimmed Mean 2,21
Median 2,00
Variance 1,004
Std. Deviation 1,002
Minimum 1
Maximum 5
Range 4
Interquartile Range 2
Skewness ,487 ,022
Kurtosis -,314 ,045
A bar chart of the percentage of values reveals that the distribution of responses is fairly
skewed to the right (Skewness = 0.487), as most participants seem to possess at least “Good”
health. Linear regression requires normality of the variables involved, but since it is fairly
resistant to non-normal variables, this is probably not an issue.
Number of Cigarettes Smoked Per Day (IV1): This variable is continuous, more
specifically, it is measured at the interval level (since it has a natural zero point that means no
cigarettes smoked).
Descriptive statistics reveal that the mean number of cigarettes smoked per day is 20.07 (SD =
13.699), while the median is 20. Again, the two are fairly close. The maximum is 98, which
means that a small minority of people actually smoke almost 5 packs of cigarettes a day. The
minimum is 1, the reason for which is that in the original dataset, non-smokers were coded as
missing data for this variable. However, since we have no way of determining which
participants left the variable empty because they were non-smokers, and which left it empty
for other, random reasons, we did not modify the variable, and therefore the analysis will only
include smokers. Descriptive statistics also show a high degree of positive skewness
(skewness = 1.713) indicating a long right tail and positive kurtosis (kurtosis = 4.702)
indicating a presence of outliers which is also seen on the histogram.
Descriptives
The histogram reveals that the vast majority of people smoke at most 20-25 cigarettes per day,
and the rest ranges from 25 to as much as 98 cigarettes per day, with frequencies decreasing
towards the higher end of the scale.
Weight in Kg (IV2): This variable measures the weight of participants in kilograms rounded
to the closest integer value. The variable is also continuous:
Descriptives
Statistic Std. Error
WEIGHT IN Mean 68,66 ,129
KILOGRAMS 95% Confidence Interval Lower Bound 68,40
for Mean Upper Bound 68,91
5% Trimmed Mean 68,07
Median 67,00
Variance 195,868
Std. Deviation 13,995
Minimum 41
Maximum 125
Range 84
Interquartile Range 20
Skewness ,619 ,023
Kurtosis ,517 ,045
Mean weight is 68.66 kilograms (SD = 13.995), the minimum is 41 (since only people above
the age of 15 were allowed to participate in the survey) and the maximum is 120. The
histogram of the variable shows a fairly normal-shaped distribution which is slightly right-
skewed, because values are suddenly cut off in the left tail at 40, but reach up to 120 in the
right tail:
As can be seen, the distribution is fairly symmetric, with both skewness (0.619) and kurtosis
(0.517) under 1.
Years Respondent Smoked (DVYRSMKD): This continuous variable codes the number of
years during which the respondent has smoked, rounded to the nearest integer value. The
mean years smoked is 20.16 (SD = 14.386) indicating that there is quite a lot of variation:
Descriptives
Statistic Std. Error
The variable is also positively skewed (skewness = 0.733) but not to an extreme degree. This
skewness is the result of the fact that the variable is bounded at zero on the left, but potentially
unbounded on the right. Kurtosis is almost zero (-0.086) showing that the distribution is
neither too flat nor too pointy. The histogram illustrates this well:
Sex (control variable): Sex was simply coded as a dichotomous, categorical variable, with
Males being coded as ‘1’ and Females coded as ‘2’. After applying the relweight variable in
SPSS, the ratio of males to females in the dataset is almost equal:
Hypotheses:
Null Hypothesis : Smoking And Weight are both unrelated to health quality, when
controlling for gender
Alternative hypothesis 1: Smoking will be negatively related to quality of health: the more
cigarettes people smoke, the lower their perceived quality of health will be
Alternative Hypothesis 2: Weight will be negatively related to quality of health: the higher
the weight, the lower the perceived quality of health will be
Regression results:
The overall F-test (ANOVA) for the regression model was statistically significant (F(4) =
57.987, p < 0.001) which means that at least some of the variables make a significant
contribution to predicting Health Status.
The coefficient of determination is, however, very low (R2 and R2adj are both 0.032), meaning
that overall, the model accounts for 3.2% of the variance in Health Status, so there is much
room for improving the model. The reason for this may be that Health Status is probably
influenced by countless other variables not included in the model, or not even measured by
the survey.
Out of the three independent variables of interest, the effect of Number of Cigarettes Smoked
(b = -0.004, β = -0.054, t = -4.246, p < 0.001), the effect of weight (b = -0.003, β = -0.048, t =
-3.232, p < 0.001) and the effect of Number of Years Smoked (b = -0.011, β = -0.161, t = -
12.960, p < 0.001) are all statistically significant. In practical terms, this means that as the
number of cigarettes smoked increases by one unit, perceived health status decreases by 0.004
units. So a person smoking 40 cigarettes a day is expected to be 0.16 units lower on health
status, all things being equal. A one kilogram increase in weight results in a 0.003 units of
decrease in Health Status, and an additional year spent smoking decreases Health Status by
0.011 units. Comparing the standardized regression coefficients (β) it seems that Years Spent
Smoking is actually the strongest predictor of Health Status out of all variables in the model.
Incidentally, the control variable, Sex, is also significant (b = -0.062, β = -0.031, t = -2.083, p
=0.037), indicating a 0.062 units of Health Status difference between males and females, with
females having a lower perceived Health Status.