Exercise2 Submission Group 12 Yalcin Mehmet
Exercise2 Submission Group 12 Yalcin Mehmet
Exercise2 Submission Group 12 Yalcin Mehmet
University of Oldenburg
Summer 2023
Instructors: Maria Fernanda "MaFe" Davila Restrepo, Wolfram "Wolle" Wingerath
1.) Suppose that 80% of people like peanut butter, 89% like jelly, and 78% like both. Given that a randomly
sampled person likes peanut butter, what is the probability that she also likes jelly?
Solution:
print("The probability of liking jelly given that the person likes peanut butter is:",
The probability of liking jelly given that the person likes peanut butter is: 0.975
By the way, you can also add markdown cells where you can use LaTeX, e.g. in equations:
2x + y = 5 x − y = 1
Solution:
we do not have any information about the relationship between A and B. In other words, we do not
know whether A and B are independent or not. If A and B are independent events, then we can
compute P(A and B) using the following formula: P (A ∩ B) = P (A) ⋅ P (B)
(b) Assuming that events A and B arise from independent random processes:
(1) What is P(A and B)?
If events A and B are independent, then the probability of both events occurring, denoted by
P (A ∩ B) , can be computed using the multiplication rule of probability:
P (A ∩ B) = P (A) ⋅ P (B)
Solution:
# Compute the probability of both events occurring using the multiplication rule
P_A_and_B = P_A * P_B
(2) Again assuming that A and B are independent, what is P(A or B)?
Solution: If events A and B are independent, then the probability of either A or B occurring, denoted
by P (A ∪ B), can be computed using the addition rule of probability:
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
Since A and B are independent, we have P (A ∩ B) = P (A) ⋅ P (B) , so we can substitute this
expression into the equation above to get:
# Compute P(A or B) using the equation P(A or B) = P(A) + P(B) - P(A and B)
P_A_or_B = P_A + P_B - (P_A * P_B)
P(A or B) = 0.79
Solution:
P (A ∩ B)
P (A|B) =
P (B)
In [42]: # Assuming A and B are independent
P_A = 0.3
P_B = 0.7
P(A|B) = 0.3
3.) Consider a game where your score is the maximum value from two dice throws. Write a small python
function that outputs the probability of each event from {1, ..., 6}.
Solution:
In [43]: ##each possible score in the game where the score is the maximum value from two dice th
def dice_probability():
dice = [1, 2, 3, 4, 5, 6]
prob = [0] * 6
for i in range(6):
for j in range(6):
if dice[i] > dice[j]:
prob[dice[i]-1] += 1
for i in range(6):
prob[i] /= 36
return prob
print(dice_probability())
4.) If two binary random variables X and Y are independent, is the complement of X and Y also
independent? Give a proof or a counterexample.
Solution: P(Z=(0,0))P(Z=(1,1)) = 0.0625 P(Z=(0,1))P(Z=(1,0)) = 0.0625 P(Z) = [0.25 0.25 0.25 0.25]
****************************************************************
P(Z=(0,0))P(Z=(1,1)) = 0.0625
P(Z=(0,1))P(Z=(1,0)) = 0.0625
P(Z) = [0.25 0.25 0.25 0.25]
(a) Decide which one has the greater mean and the greater standard deviation without computing either
of those.
Solution:
(b) Compute mean and standard deviation of these distributions using built-in functions of the pandas
library.
Solution:
# print results
print("Mean of dist1:", mean1)
print("Mean of dist2:", mean2)
print("Standard deviation of dist1:", std1)
print("Standard deviation of dist2:", std2)
(c) Now define a function yourself to compute mean and standard deviation. Verify that the functions work
correctly by comparing the output with the output of the built-in functions.
Solution: ********************************************************
Solution:
dist1 = [1, 1, 1, 1, 1, 1, 1, 1, 1]
arithmetic_mean = np.mean(dist1)
geometric_mean = np.power(np.prod(dist1), 1/len(dist1))
3.) How do the arithmetic and geometric mean compare on random integers that are not identical? Draw a
sample distribution of size 10 a few times and write down your findings.
Note:
Please implement your own version of the geometric mean and use it in this task.
Solution: As we can see from the output, the arithmetic and geometric mean can vary depending on
the sample distribution. However, in general, the geometric mean tends to be lower than the
arithmetic mean, especially for distributions with extreme values. This is because the geometric mean
is affected more by smaller values than larger values, while the arithmetic mean takes into account all
values equally.
def arithmetic_mean(distribution):
return sum(distribution)/len(distribution)
def geometric_mean(distribution):
return math.prod(distribution)**(1/len(distribution))
1.) A correlation coefficient of −0.9 indicates a stronger linear relationship than a correlation coefficient of
0.5 – true or false? Explain why.
Solution: True. The correlation coefficient is a measure of the strength and direction of the linear
relationship between two variables. It ranges from -1 to 1, where -1 indicates a perfectly negative
linear relationship, 0 indicates no linear relationship, and 1 indicates a perfectly positive linear
relationship. In the case of a correlation coefficient of -0.9, it indicates a strong negative linear
relationship between the two variables. This means that as one variable increases, the other variable
tends to decrease in a linear fashion. On the other hand, a correlation coefficient of 0.5 indicates a
moderate positive linear relationship, where as one variable increases, the other variable tends to
increase as well, but to a lesser extent than in the case of a correlation coefficient of 1. Therefore, the
statement is true as a correlation coefficient of -0.9 indicates a stronger linear relationship than a
correlation coefficient of 0.5.
The output of this code will show that the correlation coefficient of -0.9 indicates a stronger linear
relationship than a correlation coefficient of 0.5. This is because the absolute value of the correlation
coefficient indicates the strength of the linear relationship between two variables, and the closer the
absolute value is to 1, the stronger the relationship. A correlation coefficient of -0.9 indicates a strong
negative linear relationship, while a correlation coefficient of 0.5 indicates a moderate positive linear
relationship.
2.) Compute the Pearson and Spearman Rank correlations for uniformly drawn samples of points (x, x^k)
and answer the questions below.
Note:
You can use scipy.stats.spearmanr and scipy.stats.pearsonr to compute the ranks.
Solution: This code generates a sample of n points by randomly selecting x values and a single
exponent k. It then calculates y = x^k. Finally, it computes both the Pearson and Spearman rank
correlations between x and y. The results are printed to the console. You can run this code multiple
times to observe the different correlations obtained for different random samples of x and k
# Sample size
n = 100
# Calculate y = x^k
y = np.power(x, k)
# Print results
print(f"Pearson correlation: {pearson_corr}")
print(f"Spearman rank correlation: {spearman_corr}")
Solution: As k increases, the Pearson correlation coefficient tends to increase towards 1 for positive
values of k, indicating a stronger linear relationship between the two variables. For negative values of
k, the Pearson correlation coefficient tends to decrease towards -1, also indicating a strong linear
relationship. On the other hand, the Spearman rank correlation tends to decrease in magnitude as k
increases, indicating a weaker monotonic relationship between the two variables. This is because as k
increases, the points become more widely spread out, making it less likely for the ranks to be perfectly
aligned. In general, for larger values of k, the relationship between the two variables becomes less
predictable and more complex, making it more difficult to capture with a simple correlation coefficient.
(b) Does it matter whether x can be positive and negative or positive only?
Solution: Yes, it matters whether x can be positive and negative or positive only in calculating
correlations. If x can be positive and negative, the Pearson correlation coefficient is the appropriate
measure to use as it measures the linear relationship between two variables. The Pearson correlation
coefficient is only valid for variables that are both normally distributed. If x can only be positive, the
Spearman rank correlation coefficient may be more appropriate as it measures the strength and
direction of the monotonic relationship between two variables. The Spearman rank correlation
coefficient does not make any assumptions about the distribution of the variables and is robust to
outliers.
1/n
n
(∏ a i ) = √
n
a1 ⋅ a2 ⋯ an
i=1
For large sample sets or small sample sets with very large numbers, the computed product will become
very large. Chances are that your implementation of the geometric mean does not work with large sample
sizes, either – go ahead and try it with a sample size of, let's say, 1000!
Assuming that your implementation does not work for a sample size of 1000, create one that does!
Solution:
One way to handle large sample sizes when computing the geometric mean is to use the logarithmic
form of the formula. The logarithmic form of the geometric mean is:
Using this formula, we can compute the geometric mean for large sample sizes by
computing the sum of the logarithms of the data points and then taking the
exponential of the sum divided by the sample size.
1/n
n
(∏ a i ) = √
n
a1 ⋅ a2 ⋯ an
i=1
In this implementation, we first generate a list of 1000 random data points using the random module. We
then calculate the sum of the logarithms of all the data points using the built-in math.log() function. This
avoids the issue of very large numbers that can arise when calculating the product of all the data points.
We then calculate the geometric mean using the logarithmic sum and the number of data points, and
print the result using the built-in math.exp() function to exponentiate the logarithmic sum.
Finally: Submission
Save your notebook and submit it (as both notebook and PDF file). And please don't forget to ...
... choose a file name according to convention (see Exercise Sheet 1) and to
... include the execution output in your submission!