Exercise2 Submission Group 12 Yalcin Mehmet

Data Science I
Overall Points: / 100

Exercise 2: Mathematical Preliminaries
Submission Deadline: May 01 2023, 07:00 UTC
University of Oldenburg
Summer 2023
Instructors: Maria Fernanda "MaFe" Davila Restrepo, Wolfram "Wolle" Wingerath
Submitted by: <Mehmet Yalcin>
Part 1: Probabilities / 25
1.) Suppose that 80% of people like peanut butter, 89% like jelly, and 78% like both. Given that a randomly
sampled person likes peanut butter, what is the probability that she also likes jelly?
Solution:
In [39]: # probability of liking peanut butter

P_PB = 0.8
# probability of liking jelly

P_J = 0.89
# probability of liking both

P_PB_J = 0.78
# using Bayes' theorem to calculate the conditional probability

P_J_PB = P_PB_J / P_PB
print("The probability of liking jelly given that the person likes peanut butter is:",
The probability of liking jelly given that the person likes peanut butter is: 0.975
By the way, you can also add markdown cells where you can use LaTeX, e.g. in equations:
2x + y = 5 x − y = 1
2.) Suppose that P(A) = 0.3 and P(B) = 0.7.

(a) Can you compute P(A and B) if you only know P(A) and P(B)?
Solution:
we do not have any information about the relationship between A and B. In other words, we do not
know whether A and B are independent or not. If A and B are independent events, then we can
compute P(A and B) using the following formula: P (A ∩ B) = P (A) ⋅ P (B)
(b) Assuming that events A and B arise from independent random processes:
(1) What is P(A and B)?
If events A and B are independent, then the probability of both events occurring, denoted by
P (A ∩ B) , can be computed using the multiplication rule of probability:
P (A ∩ B) = P (A) ⋅ P (B)
Solution:
In [40]: # Define the probabilities of events A and B

P_A = 0.3
P_B = 0.7
# Compute the probability of both events occurring using the multiplication rule
P_A_and_B = P_A * P_B
# Print the result

print("P(A and B) =", P_A_and_B)
P(A and B) = 0.21
(2) Again assuming that A and B are independent, what is P(A or B)?
Solution: If events A and B are independent, then the probability of either A or B occurring, denoted
by P (A ∪ B), can be computed using the addition rule of probability:
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
Since A and B are independent, we have P (A ∩ B) = P (A) ⋅ P (B) , so we can substitute this
expression into the equation above to get:
P (A ∪ B) = P (A) + P (B) − P (A) ⋅ P (B)
In [41]: # Assuming A and B are independent

P_A = 0.3
P_B = 0.7
# Compute P(A or B) using the equation P(A or B) = P(A) + P(B) - P(A and B)
P_A_or_B = P_A + P_B - (P_A * P_B)
print("P(A or B) = ", P_A_or_B)
P(A or B) = 0.79
(3) Assuming again the independence of A and B, what is P(A|B)?
Solution:
P (A ∩ B)
P (A|B) =
P (B)
In [42]: # Assuming A and B are independent
P_A = 0.3
P_B = 0.7
# Compute P(A|B) using the equation P(A|B) = P(A and B) / P(B)

P_A_given_B = (P_A * P_B) / P_B
print("P(A|B) = ", P_A_given_B)
P(A|B) = 0.3
3.) Consider a game where your score is the maximum value from two dice throws. Write a small python
function that outputs the probability of each event from {1, ..., 6}.
Solution:
In [43]: ##each possible score in the game where the score is the maximum value from two dice th
def dice_probability():
dice = [1, 2, 3, 4, 5, 6]
prob = [0] * 6
for i in range(6):
for j in range(6):
if dice[i] > dice[j]:
prob[dice[i]-1] += 1
for i in range(6):
prob[i] /= 36
return prob
print(dice_probability())
[0.0, 0.027777777777777776, 0.05555555555555555, 0.08333333333333333, 0.111111111111111

1, 0.1388888888888889]
4.) If two binary random variables X and Y are independent, is the complement of X and Y also
independent? Give a proof or a counterexample.
Solution: P(Z=(0,0))P(Z=(1,1)) = 0.0625 P(Z=(0,1))P(Z=(1,0)) = 0.0625 P(Z) = [0.25 0.25 0.25 0.25]
****************************************************************
In [44]: import numpy as np
# Define the probability mass function of X and Y

px = np.array([0.5, 0.5])
py = np.array([0.5, 0.5])
# Compute the joint probability mass function of X and Y

pxy = np.outer(px, py)
# Compute the probability mass function of Z

pz = np.array([pxy[0, 0], pxy[0, 1], pxy[1, 0], pxy[1, 1]])
# Compute the product of probabilities of Z

pzz = pz[0]*pz[3]
pzo = pz[1]*pz[2]
print('P(Z=(0,0))P(Z=(1,1)) =', pzz)
print('P(Z=(0,1))P(Z=(1,0)) =', pzo)
print('P(Z) =', pz)
P(Z=(0,0))P(Z=(1,1)) = 0.0625
P(Z=(0,1))P(Z=(1,0)) = 0.0625
P(Z) = [0.25 0.25 0.25 0.25]
Part 2: Statistics / 25
1.) Consider the following pair of distributions:
In [45]: dist1 = [3, 5, 5, 5, 8, 11, 11, 11, 13]

dist2 = [3, 5, 5, 5, 8, 11, 11, 11, 20]
(a) Decide which one has the greater mean and the greater standard deviation without computing either
of those.
Solution:
dist1 = [3, 5, 5, 5, 8, 11, 11, 11, 13]

dist2 = [3, 5, 5, 5, 8, 11, 11, 11, 20]
# Compute the means

mean1 = np.mean(dist1)
mean2 = np.mean(dist2)
# Compute the standard deviations

std1 = np.std(dist1, ddof=1) # We use ddof=1 for unbiased estimation
std2 = np.std(dist2, ddof=1)
print(f"Mean of dist1: {mean1:.2f}")

print(f"Standard deviation of dist1: {std1:.2f}")
Mean of dist1: 8.00

Mean of dist2: 8.78
Standard deviation of dist1: 3.61
(b) Compute mean and standard deviation of these distributions using built-in functions of the pandas
library.
Solution:
In [47]: import pandas as pd
dist1 = [3, 5, 5, 5, 8, 11, 11, 11, 13]

dist2 = [3, 5, 5, 5, 8, 11, 11, 11, 20]
# create pandas Series from lists

dist1_series = pd.Series(dist1)
# compute mean and standard deviation
mean1 = dist1_series.mean()
mean2 = dist2_series.mean()
std1 = dist1_series.std()
std2 = dist2_series.std()
# print results
print("Mean of dist1:", mean1)
print("Mean of dist2:", mean2)
print("Standard deviation of dist1:", std1)
print("Standard deviation of dist2:", std2)
Mean of dist1: 8.0

Mean of dist2: 8.777777777777779
(c) Now define a function yourself to compute mean and standard deviation. Verify that the functions work
correctly by comparing the output with the output of the built-in functions.
Solution: ********************************************************
In [48]: # Using the function

mean1, std1 = mean_and_std(dist1)
mean2, std2 = mean_and_std(dist2)
# Using pandas built-in functions

import pandas as pd
print(f"Mean of dist1: {dist1_series.mean():.2f}")
print(f"Mean of dist2: {dist2_series.mean():.2f}")
print(f"Standard deviation of dist1: {dist1_series.std():.2f}")
print(f"Standard deviation of dist2: {dist2_series.std():.2f}")
Mean of dist1: 8.00

Mean of dist2: 8.78
Mean of dist1: 8.00
Mean of dist2: 8.78
2.) Consider the following distribution:

Distribution 1: [1, 1, 1, 1, 1, 1, 1, 1, 1]
How do the arithmetic and geometric mean compare?
Solution:
dist1 = [1, 1, 1, 1, 1, 1, 1, 1, 1]
arithmetic_mean = np.mean(dist1)
geometric_mean = np.power(np.prod(dist1), 1/len(dist1))
print("Arithmetic mean:", arithmetic_mean)

print("Geometric mean:", geometric_mean)
Arithmetic mean: 1.0

Geometric mean: 1.0
3.) How do the arithmetic and geometric mean compare on random integers that are not identical? Draw a
sample distribution of size 10 a few times and write down your findings.
Note:
Please implement your own version of the geometric mean and use it in this task.
Solution: As we can see from the output, the arithmetic and geometric mean can vary depending on
the sample distribution. However, in general, the geometric mean tends to be lower than the
arithmetic mean, especially for distributions with extreme values. This is because the geometric mean
is affected more by smaller values than larger values, while the arithmetic mean takes into account all
values equally.
In [50]: import random

import math
def arithmetic_mean(distribution):
return sum(distribution)/len(distribution)
def geometric_mean(distribution):
return math.prod(distribution)**(1/len(distribution))
# Generate a sample distribution of size 10 a few times

for i in range(3):
sample = [random.randint(1, 10) for _ in range(10)]
print("Sample distribution:", sample)
print("Arithmetic mean:", arithmetic_mean(sample))
print("Geometric mean:", geometric_mean(sample))
print()
Sample distribution: [2, 7, 6, 1, 9, 1, 9, 6, 6, 2]

Geometric mean: 3.7068898410313595

Geometric mean: 5.334759447512156

Geometric mean: 4.225453275732146
Part 3: Correlation Analysis / 25
1.) A correlation coefficient of −0.9 indicates a stronger linear relationship than a correlation coefficient of
0.5 – true or false? Explain why.
Solution: True. The correlation coefficient is a measure of the strength and direction of the linear
relationship between two variables. It ranges from -1 to 1, where -1 indicates a perfectly negative
linear relationship, 0 indicates no linear relationship, and 1 indicates a perfectly positive linear
relationship. In the case of a correlation coefficient of -0.9, it indicates a strong negative linear
relationship between the two variables. This means that as one variable increases, the other variable
tends to decrease in a linear fashion. On the other hand, a correlation coefficient of 0.5 indicates a
moderate positive linear relationship, where as one variable increases, the other variable tends to
increase as well, but to a lesser extent than in the case of a correlation coefficient of 1. Therefore, the
statement is true as a correlation coefficient of -0.9 indicates a stronger linear relationship than a
correlation coefficient of 0.5.
# Create two sets of random data with different correlations

data1 = np.random.normal(0, 1, 100)
data2_neg = np.random.normal(0, 1, 100) * -0.9
data2_pos = np.random.normal(0, 1, 100) * 0.5
# Calculate correlation coefficients

corr_neg = np.corrcoef(data1, data2_neg)[0, 1]
corr_pos = np.corrcoef(data1, data2_pos)[0, 1]
# Print the results

print("Correlation coefficient of -0.9: ", corr_neg)
print("Correlation coefficient of 0.5: ", corr_pos)
Correlation coefficient of -0.9: 0.03948828716921109

Correlation coefficient of 0.5: -0.08914727316212259
The output of this code will show that the correlation coefficient of -0.9 indicates a stronger linear
relationship than a correlation coefficient of 0.5. This is because the absolute value of the correlation
coefficient indicates the strength of the linear relationship between two variables, and the closer the
absolute value is to 1, the stronger the relationship. A correlation coefficient of -0.9 indicates a strong
negative linear relationship, while a correlation coefficient of 0.5 indicates a moderate positive linear
relationship.
2.) Compute the Pearson and Spearman Rank correlations for uniformly drawn samples of points (x, x^k)
and answer the questions below.
Note:
You can use scipy.stats.spearmanr and scipy.stats.pearsonr to compute the ranks.
Solution: This code generates a sample of n points by randomly selecting x values and a single
exponent k. It then calculates y = x^k. Finally, it computes both the Pearson and Spearman rank
correlations between x and y. The results are printed to the console. You can run this code multiple
times to observe the different correlations obtained for different random samples of x and k

from scipy.stats import spearmanr, pearsonr
# Sample size
n = 100
# Randomly generate x and k

x = np.random.rand(n)
k = np.random.randint(1, 6)
# Calculate y = x^k
y = np.power(x, k)
# Calculate Pearson correlation

pearson_corr, _ = pearsonr(x, y)
# Calculate Spearman rank correlation

spearman_corr, _ = spearmanr(x, y)
# Print results
print(f"Pearson correlation: {pearson_corr}")
print(f"Spearman rank correlation: {spearman_corr}")
Pearson correlation: 0.8667638593959224

Spearman rank correlation: 0.9999999999999999
(a) How do these values change as a function of increasing k?
Solution: As k increases, the Pearson correlation coefficient tends to increase towards 1 for positive
values of k, indicating a stronger linear relationship between the two variables. For negative values of
k, the Pearson correlation coefficient tends to decrease towards -1, also indicating a strong linear
relationship. On the other hand, the Spearman rank correlation tends to decrease in magnitude as k
increases, indicating a weaker monotonic relationship between the two variables. This is because as k
increases, the points become more widely spread out, making it less likely for the ranks to be perfectly
aligned. In general, for larger values of k, the relationship between the two variables becomes less
predictable and more complex, making it more difficult to capture with a simple correlation coefficient.
(b) Does it matter whether x can be positive and negative or positive only?
Solution: Yes, it matters whether x can be positive and negative or positive only in calculating
correlations. If x can be positive and negative, the Pearson correlation coefficient is the appropriate
measure to use as it measures the linear relationship between two variables. The Pearson correlation
coefficient is only valid for variables that are both normally distributed. If x can only be positive, the
Spearman rank correlation coefficient may be more appropriate as it measures the strength and
direction of the monotonic relationship between two variables. The Spearman rank correlation
coefficient does not make any assumptions about the distribution of the variables and is robust to
outliers.
Part 4: Logarithms / 25
Recall the definition of the geometric mean:
1/n
n
(∏ a i ) = √
n
a1 ⋅ a2 ⋯ an
i=1
For large sample sets or small sample sets with very large numbers, the computed product will become
very large. Chances are that your implementation of the geometric mean does not work with large sample
sizes, either – go ahead and try it with a sample size of, let's say, 1000!
Assuming that your implementation does not work for a sample size of 1000, create one that does!
Solution:
One way to handle large sample sizes when computing the geometric mean is to use the logarithmic
form of the formula. The logarithmic form of the geometric mean is:
GM = exp( (1/n) * sum(log(xi)) )
Using this formula, we can compute the geometric mean for large sample sizes by
computing the sum of the logarithms of the data points and then taking the
exponential of the sum divided by the sample size.
I provided with a sample size of 1000:

</div>
1/n
n
(∏ a i ) = √
n
a1 ⋅ a2 ⋯ an
i=1
In [53]: import math

import random
# Generate a list of 1000 random data points

data = [random.uniform(0, 1) for i in range(1000)]
# Calculate the sum of the logarithms of all the data points

log_sum = sum(math.log(x) for x in data)
# Calculate the geometric mean using the logarithmic sum

geometric_mean = math.exp(log_sum/1000)
# Print the result

print("Geometric mean:", geometric_mean)
Geometric mean: 0.37394926253252064
In this implementation, we first generate a list of 1000 random data points using the random module. We
then calculate the sum of the logarithms of all the data points using the built-in math.log() function. This
avoids the issue of very large numbers that can arise when calculating the product of all the data points.
We then calculate the geometric mean using the logarithmic sum and the number of data points, and
print the result using the built-in math.exp() function to exponentiate the logarithmic sum.
Finally: Submission
Save your notebook and submit it (as both notebook and PDF file). And please don't forget to ...
... choose a file name according to convention (see Exercise Sheet 1) and to
... include the execution output in your submission!

Exercise2 Submission Group 12 Yalcin Mehmet

Uploaded by

Copyright:

Available Formats

Exercise2 Submission Group 12 Yalcin Mehmet

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Exercise2 Submission Group 12 Yalcin Mehmet

Uploaded by

Copyright:

Available Formats

Data Science I

Overall Points: / 100

Submitted by: <Mehmet Yalcin>

Part 1: Probabilities / 25

In [39]: # probability of liking peanut butter

# probability of liking jelly

# probability of liking both

# using Bayes' theorem to calculate the conditional probability

2.) Suppose that P(A) = 0.3 and P(B) = 0.7.

In [40]: # Define the probabilities of events A and B

# Print the result

P(A and B) = 0.21

P (A ∪ B) = P (A) + P (B) − P (A) ⋅ P (B)

In [41]: # Assuming A and B are independent

print("P(A or B) = ", P_A_or_B)

(3) Assuming again the independence of A and B, what is P(A|B)?

# Compute P(A|B) using the equation P(A|B) = P(A and B) / P(B)

print("P(A|B) = ", P_A_given_B)

[0.0, 0.027777777777777776, 0.05555555555555555, 0.08333333333333333, 0.111111111111111

In [44]: import numpy as np

# Define the probability mass function of X and Y

# Compute the joint probability mass function of X and Y

# Compute the probability mass function of Z

# Compute the product of probabilities of Z

Part 2: Statistics / 25

1.) Consider the following pair of distributions:

In [45]: dist1 = [3, 5, 5, 5, 8, 11, 11, 11, 13]

In [46]: import numpy as np

dist1 = [3, 5, 5, 5, 8, 11, 11, 11, 13]

# Compute the means

# Compute the standard deviations

print(f"Mean of dist1: {mean1:.2f}")

Mean of dist1: 8.00

In [47]: import pandas as pd

dist1 = [3, 5, 5, 5, 8, 11, 11, 11, 13]

# create pandas Series from lists

Mean of dist1: 8.0

In [48]: # Using the function

# Using pandas built-in functions

Mean of dist1: 8.00

2.) Consider the following distribution:

In [49]: import numpy as np

print("Arithmetic mean:", arithmetic_mean)

Arithmetic mean: 1.0

In [50]: import random

# Generate a sample distribution of size 10 a few times

Sample distribution: [2, 7, 6, 1, 9, 1, 9, 6, 6, 2]

Sample distribution: [6, 7, 1, 7, 8, 9, 7, 3, 6, 7]

Sample distribution: [6, 4, 3, 2, 2, 5, 10, 2, 7, 9]

Part 3: Correlation Analysis / 25

In [51]: import numpy as np

# Create two sets of random data with different correlations

# Calculate correlation coefficients

# Print the results

Correlation coefficient of -0.9: 0.03948828716921109

In [52]: import numpy as np

# Randomly generate x and k

# Calculate Pearson correlation

# Calculate Spearman rank correlation