Exercise2 Submission Group 12 Yalcin Mehmet

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Data Science I

Overall Points:       / 100


Exercise 2: Mathematical Preliminaries
Submission Deadline: May 01 2023, 07:00 UTC

University of Oldenburg
Summer 2023
Instructors: Maria Fernanda "MaFe" Davila Restrepo, Wolfram "Wolle" Wingerath

Submitted by: <Mehmet Yalcin>

Part 1: Probabilities       / 25

1.) Suppose that 80% of people like peanut butter, 89% like jelly, and 78% like both. Given that a randomly
sampled person likes peanut butter, what is the probability that she also likes jelly?

Solution:

In [39]: # probability of liking peanut butter


P_PB = 0.8

# probability of liking jelly


P_J = 0.89

# probability of liking both


P_PB_J = 0.78

# using Bayes' theorem to calculate the conditional probability


P_J_PB = P_PB_J / P_PB

print("The probability of liking jelly given that the person likes peanut butter is:",

The probability of liking jelly given that the person likes peanut butter is: 0.975

By the way, you can also add markdown cells where you can use LaTeX, e.g. in equations:

2x + y = 5 x − y = 1

2.) Suppose that P(A) = 0.3 and P(B) = 0.7.


(a) Can you compute P(A and B) if you only know P(A) and P(B)?

Solution:

we do not have any information about the relationship between A and B. In other words, we do not
know whether A and B are independent or not. If A and B are independent events, then we can
compute P(A and B) using the following formula: P (A ∩ B) = P (A) ⋅ P (B)

(b) Assuming that events A and B arise from independent random processes:
(1) What is P(A and B)?

If events A and B are independent, then the probability of both events occurring, denoted by
P (A ∩ B) , can be computed using the multiplication rule of probability:

P (A ∩ B) = P (A) ⋅ P (B)

Solution:

In [40]: # Define the probabilities of events A and B


P_A = 0.3
P_B = 0.7

# Compute the probability of both events occurring using the multiplication rule
P_A_and_B = P_A * P_B

# Print the result


print("P(A and B) =", P_A_and_B)

P(A and B) = 0.21

(2) Again assuming that A and B are independent, what is P(A or B)?

Solution: If events A and B are independent, then the probability of either A or B occurring, denoted
by P (A ∪ B), can be computed using the addition rule of probability:

P (A ∪ B) = P (A) + P (B) − P (A ∩ B)

Since A and B are independent, we have P (A ∩ B) = P (A) ⋅ P (B) , so we can substitute this
expression into the equation above to get:

P (A ∪ B) = P (A) + P (B) − P (A) ⋅ P (B)

In [41]: # Assuming A and B are independent


P_A = 0.3
P_B = 0.7

# Compute P(A or B) using the equation P(A or B) = P(A) + P(B) - P(A and B)
P_A_or_B = P_A + P_B - (P_A * P_B)

print("P(A or B) = ", P_A_or_B)

P(A or B) = 0.79

(3) Assuming again the independence of A and B, what is P(A|B)?

Solution:

P (A ∩ B)
P (A|B) =
P (B)
In [42]: # Assuming A and B are independent
P_A = 0.3
P_B = 0.7

# Compute P(A|B) using the equation P(A|B) = P(A and B) / P(B)


P_A_given_B = (P_A * P_B) / P_B

print("P(A|B) = ", P_A_given_B)

P(A|B) = 0.3

3.) Consider a game where your score is the maximum value from two dice throws. Write a small python
function that outputs the probability of each event from {1, ..., 6}.

Solution:

In [43]: ##each possible score in the game where the score is the maximum value from two dice th

def dice_probability():
dice = [1, 2, 3, 4, 5, 6]
prob = [0] * 6

for i in range(6):
for j in range(6):
if dice[i] > dice[j]:
prob[dice[i]-1] += 1

for i in range(6):
prob[i] /= 36

return prob

print(dice_probability())

[0.0, 0.027777777777777776, 0.05555555555555555, 0.08333333333333333, 0.111111111111111


1, 0.1388888888888889]

4.) If two binary random variables X and Y are independent, is the complement of X and Y also
independent? Give a proof or a counterexample.

Solution: P(Z=(0,0))P(Z=(1,1)) = 0.0625 P(Z=(0,1))P(Z=(1,0)) = 0.0625 P(Z) = [0.25 0.25 0.25 0.25]
****************************************************************

In [44]: import numpy as np

# Define the probability mass function of X and Y


px = np.array([0.5, 0.5])
py = np.array([0.5, 0.5])

# Compute the joint probability mass function of X and Y


pxy = np.outer(px, py)

# Compute the probability mass function of Z


pz = np.array([pxy[0, 0], pxy[0, 1], pxy[1, 0], pxy[1, 1]])

# Compute the product of probabilities of Z


pzz = pz[0]*pz[3]
pzo = pz[1]*pz[2]
print('P(Z=(0,0))P(Z=(1,1)) =', pzz)
print('P(Z=(0,1))P(Z=(1,0)) =', pzo)
print('P(Z) =', pz)

P(Z=(0,0))P(Z=(1,1)) = 0.0625
P(Z=(0,1))P(Z=(1,0)) = 0.0625
P(Z) = [0.25 0.25 0.25 0.25]

Part 2: Statistics       / 25

1.) Consider the following pair of distributions:

In [45]: dist1 = [3, 5, 5, 5, 8, 11, 11, 11, 13]


dist2 = [3, 5, 5, 5, 8, 11, 11, 11, 20]

(a) Decide which one has the greater mean and the greater standard deviation without computing either
of those.

Solution:

In [46]: import numpy as np

dist1 = [3, 5, 5, 5, 8, 11, 11, 11, 13]


dist2 = [3, 5, 5, 5, 8, 11, 11, 11, 20]

# Compute the means


mean1 = np.mean(dist1)
mean2 = np.mean(dist2)

# Compute the standard deviations


std1 = np.std(dist1, ddof=1) # We use ddof=1 for unbiased estimation
std2 = np.std(dist2, ddof=1)

print(f"Mean of dist1: {mean1:.2f}")


print(f"Mean of dist2: {mean2:.2f}")
print(f"Standard deviation of dist1: {std1:.2f}")
print(f"Standard deviation of dist2: {std2:.2f}")

Mean of dist1: 8.00


Mean of dist2: 8.78
Standard deviation of dist1: 3.61
Standard deviation of dist2: 5.21

(b) Compute mean and standard deviation of these distributions using built-in functions of the pandas
library.

Solution:

In [47]: import pandas as pd

dist1 = [3, 5, 5, 5, 8, 11, 11, 11, 13]


dist2 = [3, 5, 5, 5, 8, 11, 11, 11, 20]

# create pandas Series from lists


dist1_series = pd.Series(dist1)
dist2_series = pd.Series(dist2)
# compute mean and standard deviation
mean1 = dist1_series.mean()
mean2 = dist2_series.mean()
std1 = dist1_series.std()
std2 = dist2_series.std()

# print results
print("Mean of dist1:", mean1)
print("Mean of dist2:", mean2)
print("Standard deviation of dist1:", std1)
print("Standard deviation of dist2:", std2)

Mean of dist1: 8.0


Mean of dist2: 8.777777777777779
Standard deviation of dist1: 3.605551275463989
Standard deviation of dist2: 5.214829282387338

(c) Now define a function yourself to compute mean and standard deviation. Verify that the functions work
correctly by comparing the output with the output of the built-in functions.

Solution: ********************************************************

In [48]: # Using the function


mean1, std1 = mean_and_std(dist1)
mean2, std2 = mean_and_std(dist2)
print(f"Mean of dist1: {mean1:.2f}")
print(f"Mean of dist2: {mean2:.2f}")
print(f"Standard deviation of dist1: {std1:.2f}")
print(f"Standard deviation of dist2: {std2:.2f}")

# Using pandas built-in functions


import pandas as pd
dist1_series = pd.Series(dist1)
dist2_series = pd.Series(dist2)
print(f"Mean of dist1: {dist1_series.mean():.2f}")
print(f"Mean of dist2: {dist2_series.mean():.2f}")
print(f"Standard deviation of dist1: {dist1_series.std():.2f}")
print(f"Standard deviation of dist2: {dist2_series.std():.2f}")

Mean of dist1: 8.00


Mean of dist2: 8.78
Standard deviation of dist1: 3.40
Standard deviation of dist2: 4.92
Mean of dist1: 8.00
Mean of dist2: 8.78
Standard deviation of dist1: 3.61
Standard deviation of dist2: 5.21

2.) Consider the following distribution:


Distribution 1: [1, 1, 1, 1, 1, 1, 1, 1, 1]
How do the arithmetic and geometric mean compare?

Solution:

In [49]: import numpy as np

dist1 = [1, 1, 1, 1, 1, 1, 1, 1, 1]

arithmetic_mean = np.mean(dist1)
geometric_mean = np.power(np.prod(dist1), 1/len(dist1))

print("Arithmetic mean:", arithmetic_mean)


print("Geometric mean:", geometric_mean)

Arithmetic mean: 1.0


Geometric mean: 1.0

3.) How do the arithmetic and geometric mean compare on random integers that are not identical? Draw a
sample distribution of size 10 a few times and write down your findings.

Note:
Please implement your own version of the geometric mean and use it in this task.

Solution: As we can see from the output, the arithmetic and geometric mean can vary depending on
the sample distribution. However, in general, the geometric mean tends to be lower than the
arithmetic mean, especially for distributions with extreme values. This is because the geometric mean
is affected more by smaller values than larger values, while the arithmetic mean takes into account all
values equally.

In [50]: import random


import math

def arithmetic_mean(distribution):
return sum(distribution)/len(distribution)

def geometric_mean(distribution):
return math.prod(distribution)**(1/len(distribution))

# Generate a sample distribution of size 10 a few times


for i in range(3):
sample = [random.randint(1, 10) for _ in range(10)]
print("Sample distribution:", sample)
print("Arithmetic mean:", arithmetic_mean(sample))
print("Geometric mean:", geometric_mean(sample))
print()

Sample distribution: [2, 7, 6, 1, 9, 1, 9, 6, 6, 2]


Arithmetic mean: 4.9
Geometric mean: 3.7068898410313595

Sample distribution: [6, 7, 1, 7, 8, 9, 7, 3, 6, 7]


Arithmetic mean: 6.1
Geometric mean: 5.334759447512156

Sample distribution: [6, 4, 3, 2, 2, 5, 10, 2, 7, 9]


Arithmetic mean: 5.0
Geometric mean: 4.225453275732146

Part 3: Correlation Analysis       / 25

1.) A correlation coefficient of −0.9 indicates a stronger linear relationship than a correlation coefficient of
0.5 – true or false? Explain why.

Solution: True. The correlation coefficient is a measure of the strength and direction of the linear
relationship between two variables. It ranges from -1 to 1, where -1 indicates a perfectly negative
linear relationship, 0 indicates no linear relationship, and 1 indicates a perfectly positive linear
relationship. In the case of a correlation coefficient of -0.9, it indicates a strong negative linear
relationship between the two variables. This means that as one variable increases, the other variable
tends to decrease in a linear fashion. On the other hand, a correlation coefficient of 0.5 indicates a
moderate positive linear relationship, where as one variable increases, the other variable tends to
increase as well, but to a lesser extent than in the case of a correlation coefficient of 1. Therefore, the
statement is true as a correlation coefficient of -0.9 indicates a stronger linear relationship than a
correlation coefficient of 0.5.

In [51]: import numpy as np

# Create two sets of random data with different correlations


data1 = np.random.normal(0, 1, 100)
data2_neg = np.random.normal(0, 1, 100) * -0.9
data2_pos = np.random.normal(0, 1, 100) * 0.5

# Calculate correlation coefficients


corr_neg = np.corrcoef(data1, data2_neg)[0, 1]
corr_pos = np.corrcoef(data1, data2_pos)[0, 1]

# Print the results


print("Correlation coefficient of -0.9: ", corr_neg)
print("Correlation coefficient of 0.5: ", corr_pos)

Correlation coefficient of -0.9: 0.03948828716921109


Correlation coefficient of 0.5: -0.08914727316212259

The output of this code will show that the correlation coefficient of -0.9 indicates a stronger linear
relationship than a correlation coefficient of 0.5. This is because the absolute value of the correlation
coefficient indicates the strength of the linear relationship between two variables, and the closer the
absolute value is to 1, the stronger the relationship. A correlation coefficient of -0.9 indicates a strong
negative linear relationship, while a correlation coefficient of 0.5 indicates a moderate positive linear
relationship.

2.) Compute the Pearson and Spearman Rank correlations for uniformly drawn samples of points (x, x^k)
and answer the questions below.

Note:
You can use scipy.stats.spearmanr and scipy.stats.pearsonr to compute the ranks.

Solution: This code generates a sample of n points by randomly selecting x values and a single
exponent k. It then calculates y = x^k. Finally, it computes both the Pearson and Spearman rank
correlations between x and y. The results are printed to the console. You can run this code multiple
times to observe the different correlations obtained for different random samples of x and k

In [52]: import numpy as np


from scipy.stats import spearmanr, pearsonr

# Sample size
n = 100

# Randomly generate x and k


x = np.random.rand(n)
k = np.random.randint(1, 6)

# Calculate y = x^k
y = np.power(x, k)

# Calculate Pearson correlation


pearson_corr, _ = pearsonr(x, y)

# Calculate Spearman rank correlation


spearman_corr, _ = spearmanr(x, y)

# Print results
print(f"Pearson correlation: {pearson_corr}")
print(f"Spearman rank correlation: {spearman_corr}")

Pearson correlation: 0.8667638593959224


Spearman rank correlation: 0.9999999999999999

(a) How do these values change as a function of increasing k?

Solution: As k increases, the Pearson correlation coefficient tends to increase towards 1 for positive
values of k, indicating a stronger linear relationship between the two variables. For negative values of
k, the Pearson correlation coefficient tends to decrease towards -1, also indicating a strong linear
relationship. On the other hand, the Spearman rank correlation tends to decrease in magnitude as k
increases, indicating a weaker monotonic relationship between the two variables. This is because as k
increases, the points become more widely spread out, making it less likely for the ranks to be perfectly
aligned. In general, for larger values of k, the relationship between the two variables becomes less
predictable and more complex, making it more difficult to capture with a simple correlation coefficient.

(b) Does it matter whether x can be positive and negative or positive only?

Solution: Yes, it matters whether x can be positive and negative or positive only in calculating
correlations. If x can be positive and negative, the Pearson correlation coefficient is the appropriate
measure to use as it measures the linear relationship between two variables. The Pearson correlation
coefficient is only valid for variables that are both normally distributed. If x can only be positive, the
Spearman rank correlation coefficient may be more appropriate as it measures the strength and
direction of the monotonic relationship between two variables. The Spearman rank correlation
coefficient does not make any assumptions about the distribution of the variables and is robust to
outliers.

Part 4: Logarithms       / 25

Recall the definition of the geometric mean:

1/n
n

(∏ a i ) = √
n
a1 ⋅ a2 ⋯ an

i=1
For large sample sets or small sample sets with very large numbers, the computed product will become
very large. Chances are that your implementation of the geometric mean does not work with large sample
sizes, either – go ahead and try it with a sample size of, let's say, 1000!

Assuming that your implementation does not work for a sample size of 1000, create one that does!

Solution:

One way to handle large sample sizes when computing the geometric mean is to use the logarithmic
form of the formula. The logarithmic form of the geometric mean is:

GM = exp( (1/n) * sum(log(xi)) )

Using this formula, we can compute the geometric mean for large sample sizes by
computing the sum of the logarithms of the data points and then taking the
exponential of the sum divided by the sample size.

I provided with a sample size of 1000:


</div>

1/n
n

(∏ a i ) = √
n
a1 ⋅ a2 ⋯ an

i=1

In [53]: import math


import random

# Generate a list of 1000 random data points


data = [random.uniform(0, 1) for i in range(1000)]

# Calculate the sum of the logarithms of all the data points


log_sum = sum(math.log(x) for x in data)

# Calculate the geometric mean using the logarithmic sum


geometric_mean = math.exp(log_sum/1000)

# Print the result


print("Geometric mean:", geometric_mean)

Geometric mean: 0.37394926253252064

In this implementation, we first generate a list of 1000 random data points using the random module. We
then calculate the sum of the logarithms of all the data points using the built-in math.log() function. This
avoids the issue of very large numbers that can arise when calculating the product of all the data points.

We then calculate the geometric mean using the logarithmic sum and the number of data points, and
print the result using the built-in math.exp() function to exponentiate the logarithmic sum.

Finally: Submission
Save your notebook and submit it (as both notebook and PDF file). And please don't forget to ...

... choose a file name according to convention (see Exercise Sheet 1) and to
... include the execution output in your submission!

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy