Statistical Computing
Statistical Computing
November 2024
1.(a) Let 2
(
x
2 exp − x4 for x > 0,
f (x) =
0 for x ≤ 0
Then, the probability that x > 4 is given by
Z ∞ Z ∞ 2
x x
Pr(x > 4) = f (x) dx = exp − dx
4 4 2 4
Using the R code shown below, we compute this result to get 0.01831564:
weibull<-function(x){
ifelse(x>0, (x/2)*exp(-x^2/4),0)
}
Pr4<-integrate(weibull,lower=4,upper=Inf)
Pr4$value
n
1X
θ̂f = I(xi > 4)
n i=1
where {x1 , x2 , . . . , xn } are independent and identically distributed samples drawn from the proba-
bility density function f (x).
Using a sample size of n = 1000, the Monte Carlo estimate for θ̂f was computed to be approximately
0.019. This value represents the proportion of the distribution f (x) that lies in the region x > 4,
as estimated by the randomly drawn samples.
MC <- function(n) {
MCsample <- rweibull(n, shape = 2, scale = 2)
Finprop <- mean(MCsample > 4)
Finprop
}
MC(1000)
1
1.(c) Let g(x) denote the density function of the normal distribution with user-specified values for µ and
σ 2 . Using importance sampling, we can express θ as:
Z ∞ Z ∞
f (x)
θ= I(x > 4) g(x) dx = ϕ(x)g(x) dx = Eg [ϕ(x)],
−∞ g(x) −∞
where
(
f (x)
g(x) , if x > 4,
ϕ(x) =
0, otherwise.
n
1X
θ̂g = ϕ(xi ),
n i=1
where {x1 , x2 , . . . , xn } are samples drawn independently from the normal distribution with density
function g(x). To estimate θ, one needs to generate a sample of size n from g(x) and compute the
mean of ϕ(x) over the sampled values.
1.(d) From the properties of the Monte Carlo estimator, we know that θ̂g is unbiased, and its variance
is given by:
Var(ϕ(x1 ))
Var(θ̂g ) = .
n
To minimise Var(θ̂g ), the function g(x) should be chosen such that ϕ(x) is nearly constant. In
this scenario, g(x) becomes approximately proportional to the integrand ϕ(x)g(x). In other words,
we aim to select g(x) so that its behaviour closely resembles the density of f (x) when x > 4. By
doing so, g(x) will assign higher probability to regions where ϕ(x)g(x) is large, thus improving the
efficiency of the estimator.
Since the focus is on the upper tail of the Weibull distribution density function, the normal distri-
bution g(x) should be centred around values likely to fall in this region. Consequently, a sequence of
mean values (4.0, 4.25, 4.5, 4.75, 5.0) and variance values (1.0, 1.5, 2.0, 2.5) was considered to identify
the optimal parameters that minimise Var(θ̂g ).
From the R code implementation, it was observed that when the mean is set to 4.25 and the
variance to 1.0, the variance Var(θ̂g ) is the smallest among the tested combinations. This indicates
that these parameter for g(x) provide the most efficient importance sampling for this problem.
2
phi[i] <- dweibull(x,2,2) / dnorm(x, mean = mean, sd=sqrt(variance))
} else {
phi[i] <- 0
}
}
mean_estimate <- mean(phi)
var_estimate <- var(phi) / n
c(mean_estimate, var_estimate)
}
means <- c(4.0, 4.25, 4.5, 4.75, 5.0)
variances <- c(1.0, 1.5, 2.0, 2.5)
n <- 1000
results <- data.frame(Mean = numeric(), Variance = numeric(),
Estimated_Mean = numeric(),
Variance_of_Estimate = numeric())
for (mean in means) {
for (variance in variances) {
result <- var_thetahat(mean, variance, n)
results <- rbind(results, data.frame(Mean = mean,
Variance = variance,
Estimated_Mean = result[1],
Variance_of_Estimate = result[2]))
}
}
best_result <- results[which.min(results$Variance_of_Estimate),
c("Mean", "Variance")]
best_result
1.(e) To compare the two methods for sample sizes n ranging from 1 to 1000, we rewrote the code
to generate a plot illustrating the iteration process for both approaches, using the true value as a
reference. In the case we chose g(x) is the density function of N (4.25, 1). The plot, shown in Figure
1, provides a visual representation of how the estimators converge as the sample size increases.
From Figure 1, it is evident that the importance sampling method outperforms the Monte Carlo
integration when estimating P r(X > 4). The importance sampling approach converges more
quickly to the true value, with significantly lower variance across iterations. This demonstrates
the efficiency of importance sampling, particularly when dealing with tail probabilities such as
P r(X > 4), where the event of interest lies in a region with low probability density.
3
Comparison of Pr(X > 4) Estimation Methods
0.04
0.03
Probability Estimate
0.02
0.01
0.00
Iterations
Figure 1: Estimation of P r(X > 4);importance sampling(green line), Monte Carlo integration(blue line)
and actual P r(X > 4) (purple line)
4
phi[i] <- 0
}
e[i] <- mean(phi[1:i])
}
e
}
n <- 1000
1.(f) We developed a function to perform 500 iterations of the simulation process for both the Monte
Carlo integration and importance sampling methods, with each simulation using a sample size of
1000. This function allows us to evaluate and compare the performance of the two approaches
under repeated sampling conditions.
MCsim<-function(nsim,n){
MCsimvalue<-numeric(nsim)
for(i in 1:nsim){
MCsample <- rweibull(n, shape = 2, scale = 2)
MCsimvalue[i] <- mean(MCsample > 4)
}
MCsimvalue
}
MCsim_result<-MCsim(500,1000)
MCsim_result
imsamg_sim<-function(nsim,n){
phi_sim<-numeric(nsim)
for (j in 1:nsim){
phi <- numeric(n)
for (i in 1:n) {
x <- rnorm(1, mean = 4.25, sd = 1)
if (x > 4) {
phi[i] <- (x / 2) * exp(-x^2 / 4) / dnorm(x, mean = 4.25, sd = 1)
} else {
phi[i] <- 0
}
phi_sim[j]<-mean(phi)
}
}
phi_sim
}
imsamg_sim_result<-imsamg_sim(500,1000)
imsamg_sim_result
1.(gi) We plotted two histograms of the estimates from the Monte Carlo integration and importance sam-
pling methods, superimposing the corresponding normal distributions with their respective sample
means and variances. To assess the normality of the samples, we applied the Kolmogorov–Smirnov test.
5
Histogram of MCsim_result
80
60
Density
40
20
0
MCsim_result
However, for the Monte Carlo integration samples, tied values in the data (multiple occurrences of
the same value) disrupted the smoothness of the empirical CDF, making the Kolmogorov–Smirnov test
less reliable. So, we employed the Lilliefors test, which is an adaptation of the Kolmogorov–Smirnov test
for this scenario.
First, considering the Monte Carlo integration samples, Figure 2 provides an overview of their
distribution, which is asymmetrical and right-skewed. Hypothesis testing was conducted with the
null hypothesis that the samples follow a normal distribution versus the alternative hypothesis that
they do not. Using the Lilliefors test, we obtained a p-value = 5.218 × 10−7 , which is significantly
smaller than 0.05. Thus, we rejected the null hypothesis at the 5% significance level, concluding
that the Monte Carlo integration samples are not normally distributed.
Next, for the importance sampling samples, Figure 3 shows a distribution that visually fits the
theoretical normal distribution. To confirm this, we conducted the Kolmogorov–Smirnov test with
the same hypothesis as above, resulting in a p-value = 0.9648, which is much larger than 0.05.
Therefore, we failed to reject the null hypothesis, concluding that these samples are normally
distributed at the 5% significance level. Additionally, the QQ-plot in Figure 4 further supports this
conclusion by showing a strong agreement with the theoretical quantiles of the normal distribution.
Hence, we confirmed that the importance sampling samples follow a normal distribution with a
mean and variance matching the empirical mean and variance of the data.
MCsim_result<-MCsim(500,1000)
mean_MCsim_result<-mean(MCsim_result)
mean_MCsim_result
variance_MCsim_result<-var(MCsim_result)
variance_MCsim_result
hist(MCsim_result, freq = FALSE, main = "Histogram of MCsim_result",
xlab = "MCsim_result",xlim = c(0,0.035))
curve(dnorm(x, mean = mean_MCsim_result, sd = sqrt(variance_MCsim_result)),
add = TRUE, col = "red", lwd = 2)
library(nortest)
lillie.test(MCsim_result)
imsamg_sim_result<-imsamg_sim(500,1000)
6
Histogram of imsamg_sim_result
imsamg_sim_result
0.018
0.016
−3 −2 −1 0 1 2 3
Theoretical Quantiles
7
mean_imsamg_sim_result<-mean(imsamg_sim_result)
mean_imsamg_sim_result
variance_imsamg_sim_result<-var(imsamg_sim_result)
variance_imsamg_sim_result
hist(imsamg_sim_result, freq = FALSE, main = "Histogram of imsamg_sim_result",
xlab = "imsamg_sim_result")
curve(dnorm(x, mean = mean_imsamg_sim_result,
sd = sqrt(variance_imsamg_sim_result)),
add = TRUE, col = "red", lwd = 2)
qqnorm(imsamg_sim_result,
main = "Normal Q-Q Plot of imsamg_sim_result (nsim = 500)")
qqline(imsamg_sim_result, col = "red", lwd = 2)
1.(gii) First, let us consider Monte Carlo integration. The bias is calculated as the mean of the estimator
samples minus the true value. From the code, we obtain bias = 3.236111×10−5 , which is extremely
small. This indicates that the estimator θ̂f for Monte Carlo integration is unbiased.
Let E(θ̂) = ψ = ψ(θ). The mean squared error (mse) can be expressed as:
2
mse(θ̂) = E θ̂ − ψ + ψ − θ
2
=E θ̂ − ψ + 2 θ̂ − ψ (ψ − θ) + (ψ − θ)2
Hence, the mean squared error can be written in terms of bias and variance as:
From the simulation, the variance and mean squared error are Var(θ̂f ) = 1.783457 × 10−5 and
mse(θ̂f ) = 1.783561 × 10−5 , both of which are small. This demonstrates that θ̂f is consistent, as
mse(θ̂f ) → 0 as n → ∞.
We now compare the bias, variance, and mean squared error from the simulated data with the
exact theoretical values. For the exact mean:
n
1X
E(θ̂f ) = E(I(x > 4)) = E(I(x > 4)) = P r(X > 4) = θ = 0.01831564,
n i=1
showing that bias(θ̂f ) = 0 which means θ̂f is unbiased. Our simulation result gained above agreed
with this property.
For the variance:
1 1
Var(θ̂f ) = n · Var(I(x > 4)) = P r(X > 4)(1 − P r(X > 4)) = 1.798018 × 10−5 ,
n2 n
when n=1000, which closely matches the variance from the simulated data. Finally, for the mean
squared error:
1
mse(θ̂f ) = Var(θ̂f ) = P r(X > 4)(1 − P r(X > 4)) = 1.798018 × 10−5 ,
n
8
closely aligning with the simulated result.
Next, we focus on importance sampling. From the code, we obtain bias = −5.390686 × 10−5 , which
is extremely small. This indicates that the estimator θ̂g from importance sampling is unbiased.
Furthermore, we observe that the variance and mean squared error are Var(θ̂g ) = 5.272578 × 10−7
and mse(θ̂g ) = 5.301638 × 10−7 , both of which are small. This demonstrates that θ̂g is consistent,
as mse(θ̂f ) → 0 as n → ∞.
Hence, we can conclude that both estimators are very