Assignment 3
Assignment 3
Walleyes are the state fish of Minnesota and are the most important
game fish in MN (estimated 43,000 jobs and $2.8 billion in retail
spending). The main contaminant found in MN walleyes is mercury
which can have health consequences if ingested. Every year the MN
Dept. of Natural Resources (DNR) and the MN Dept. of Health (MDH)
publish waterway specific fish consumption guidelines for at-risk
populations - children under age 15 and women who are or may
become pregnant. Too see current consumption guidelines visit:
http://www.health.state.mn.us/divs/eh/fish/eating/sitespecific.html.
1
b) Take the natural log of HGPPM and use Analyze >
Distribution to examine the distribution of both HGPPM and
log(HGPPM). Find the sample means of the mercury levels in
the log scale ( log ( y )) and in the original scale ( y ). Convert
log ( y ) back to the original scale and compare it to y . How do
they compare? (3 pts.)
c) Repeat part (b), but this time consider the sample medians
instead. What do you find? (3 pts.)
Med(Y)= 0.625
Med(log(Y)) = -0.470
Convert the original scale = e^ (-0.470) = 0.6250
The median of the original data is equal to the back-transformed median of
the log data.
2
Examine residual plots and comment on the model assumptions. (3
pts.)
The main residual plot has the shape of a megaphone where the variance is
increasing showing a non-constant variance. And the distribution is not
normal.
The actual vs predicated plot shows a linear trend or linear r/ship between
HGPPM and LNGTHIN.
In the normal quantile plot we have a lot of outliers.
This model suggests that there is a variation in the residual across different
values of predicated y. Showing there is a non-constant variance. The
variance of errors increases or decreases as the predicated value increases
or decreases.
3
f) Despite the fact this model is clearly deficient, interpret the
both parameters estimates ^β o∧ β^ 1 in words using proper units. (2
pts.)
-0.4108 + 0.06622* LGTHIN
^β = when the predicted length value is 0, mercury level is equal to- 0.4108.
o
This represents the baseline level when the model predicts no effect.
^β =¿When length increases by one unit (inches) the mercury level will
1
increase by 0.06622 units (PPM)
g) Now fit the model:
E¿
Var ( log ( HGPPM )|LGTHIN )=σ
2
E¿
Var ( log ( HGPPM )|LGTHIN )=¿ 0.5485) ^2 = 0.30
Examine residual plots and comment on the model assumptions.
(3 pts.)
The main residual plot looks fine, there is even distribution across the x
value.
Actual vs predicated plot shows liner trend between log (HGPPM) and
LGTHIN.
The normal quantile plot: the data points are within the boundary showing
and close to the line. Showing normal distribution.
4
h) Construct a nonconstant variance plot for the model fit in part
0.5
(f), |e^ i| vs . ^y i, and discuss what this model suggests regarding
the model assumptions. (3 pts.)
This model suggests that there is no variation in the square root of the
absolute residuals as the value of the predicated y increases. The line is
constant.
This satisfies the constant variance model assumption.
You should find that the model using log (HGPPM ) as the
response is more appropriate for modeling the relationship
between the mercury levels found in the walleyes using their
length in inches. For the remainder of this problem you will be
working with the model from part (c) where the response is
log (HGPPM ).
P-value is less than 0.05 and f-ratio is big (100.453) supporting the alternative hypothesis. Also
looking at the sum of squares for the model and the total we can see that the model built using
length was able to explain 11.014 out of 17.15 of the variability in HGPPM.
Showing that the mercury level in the Walleyes does depend on length of the fish.
5
j) Conduct the following test for population slope parameter (β 1)
NH : β 1=0
AH : β 1 ≠ 0
Summarize your results. Square the t-statistic for this test and
compare it to the F-statistic from the test in part (f), what do
you find? (3 pts.)
k) What is the R-Square (R2 ) value for the regression of log (HGPPM )
on LGTHIN ? In the context of this problem, carefully explain
what this value is measuring. (3 pts.)
The R^2 value is 0.642, 62.4 % of the variation in log (HGPMM) is explained
by a model using length as a predictor, suggesting a strong predictive
relationship between LGTHIN and the log (HGPPM).
In the original scale, this implies that LGTHIN is a good predictor of the
exponential trend in HGPPM.
6
(B 1) 0.0972
e =e = 1.1021.
UCI 0.1167
e =e =1.1238 .
LCI 0.0778
e =e =1.0809 .
= e^ 0.6763= 1.97
For a walleye of 17.9 inches in length, the expected (HGPPM) is
approximately 1.97 ppm.
7
We are 95% confident that the true mean of HGPPM for walleyes of 17.9
inches in length lies between 1.854 ppm and 2.097 ppm
~
o) Give a point estimate and PI for log ( HGPPM )∨LGHTIN=20. Also
convert these back to the original scale and interpret. (4 pts.)
e^0.829 = 2.291 ppm
The estimated HGPPM for LGTHIN= 20 is approximately 2.291
ppm.
Note regarding part (l): There are no walleyes in the sample that are 20
inches in length. To obtain these estimates you will need to save the
Prediction Interval Formula to the data table and then add a new row to the
spreadsheet corresponding a walleye that is 20 inches in length.
It is recommended that humans should not consume more than one
fish per month with mercury levels in its tissues greater than .5 ppm.
Because your average walleye angler does not carry a gas
spectrometer in their fishing boat, actually measuring the Hg level
found in a walleye they have caught is a problem. However, it is very
easy for an angler to measure the length of their walleye in inches.
Note regarding parts (m & n): The process of finding an X value associated
with a specific value for the response (Y) is called inverse prediction. Also
keep in mind that your model is for log(Y) not Y, so you will need to take this
into account when answering this question.
8
1+0.4108=0.06622∗LGTHIN
1.4108/ 0.06622= LGTHIN
=21.31 inches
Yes, for a Walleye that is 7 inches in length and no for a Walleye that is 29 inches
in length.
I would not recommend using the model for predictions at both 7 inches and
29 inches because these lengths do not fall within the observed data range
used for developing the model.
t) The Island Lake walleye data also contains the weight (lbs.) for
each of the fish sampled. Do you think using weight as opposed
to length to establish consumption advisories is a good idea?
Justify your answer by fitting models for mercury or log mercury
level using X = WTLB as the predictor and contrasting the
results with those above. (4 pts.)
9
Using length to establish consumption advisories is a better idea than
using weight. The length-based model has a higher R^2 value explaining
more variability in mercury levels, and lower RMSE value compared to the
weight-based model offering more accurate predictions. Additionally, length
is easier to measure consistently in the field, making it a more practical
choice for developing consumption advisories.
u) Another possible model to consider is:
The main residual plot looks fine, there is even distribution across the x
value and constant variance.
Actual vs predicated plot shows liner trend between log (HGPPM) and
LNGTHIN.
The normal quantile plot: the data points are within the boundary showing
and close to the line. Showing normal distribution.
This model met all of the model assumptions (normality, constant variance,
independence and linearity).
10
v) Use the estimated slope ( ^β 1) from the model in part (t) to
interpret the change in the response in the original scale
associated with a 1 unit increase in the log (LGTHIN ). (3 pts.)
A 1 unit increase in log (LGTHIN) corresponds to
multiplying LGTHIN by e≈2.718
That means 1 unit increase in log (LGTHIN) will multiply the
mercury level by e^(B1) = e^(1.6697)= 5.310.
w) Use the estimated slope ( ^β 1) from the model in part (t) to
interpret the change in the response in the original scale
associated with a 20% increase in the length of walleyes. (3 pts.)
If length increases by 20 % then HGPPM will increase by 33.394%
11
z) Use R to fit the model E ( log ( HGPPM )| LGTHIN ) =β o+ β1 log ( LGTHIN )
and obtain a model summary. Include the output from R below.
To retain the appearance of R output using Courier New (10 pt)
as the font. (BONUS 5 pts.)
Code to run in R:
> Island = read.csv(file.choose())
> names(Island)
> attach(Island)
> logHg = log(HGPPM)
> logX = log(LGTHIN)
> trendscatter(logHg~logX)
> lm1 = lm(logHg~logX)
> summary(lm1)
12
> logHGPPM = log(HGPPM)
> hist(logHGPPM)
> library(s20x)
> trendscatter(logHGPPM~LGTHIN)
13
14
> lm1 = lm(logHGPPM~logLGTH)
> summary(lm1)
Call:
lm(formula = logHGPPM ~ logLGTH)
Residuals:
Min 1Q Median 3Q Max
-0.4506 -0.3002 0.0038 0.2322 0.7311
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.150 0.440 -11.7 < 2e-16 ***
logLGTH 1.670 0.157 10.6 5.2e-15 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
15