SMDM Report
SMDM Report
A wholesale distributor operating in different regions of Portugal has information on annual spending of several items in
their stores across different regions and channels. The data consists of 440 large retailers’ annual spending on 6
different varieties of products in 3 different regions (Lisbon, Oporto, Other) and across different sales channel (Hotel,
Retail).
Buyer/ Detergents
Channel Region Fresh Milk Grocery Frozen Delicatessen
Spender _Paper
1 Retail Other 12669 9656 7561 214 2674 1338
2 Retail Other 7057 9810 9568 1762 3293 1776
3 Retail Other 6353 8808 7684 2405 3516 7844
4 Hotel Other 13265 1196 4221 6404 507 1788
5 Retail Other 22615 5410 7198 3915 1777 5185
From the above table the data set has 6 different types of items. Buyer/Spender, Region and channel are categorical
data, whereas other being int type data which are measures of amount spent on different types of items.
count unique top freq mean std min 25% 50% 75% max
Buyer/Spender 440 NaN NaN NaN 220.5 127.16 1 110.8 221 330.3 440
Channel 440 2 Hotel 298 NaN NaN NaN NaN NaN NaN NaN
Region 440 3 Other 316 NaN NaN NaN NaN NaN NaN NaN
Fresh 440 NaN NaN NaN 12000 12647 3 3128 8504 16934 112151
Milk 440 NaN NaN NaN 5796 7380 55 1533 3627 7190 73498
Grocery 440 NaN NaN NaN 7951 9503 3 2153 4756 10656 92780
Frozen 440 NaN NaN NaN 3072 4855 25 742 1526 3554 60869
Detergents_Paper 440 NaN NaN NaN 2881 4768 3 257 817 3922 40827
Delicatessen 440 NaN NaN NaN 1525 2820 3 408 966 1820 47943
1) There are 3 unique values region column in which other has the most entries. There are 2 unique values in
Channel Column in which Hotel has more entries.
2) For all the 6 items, their standard deviation Is more than the mean.
3) All the 6 items have significant number of outliers which can be observed based on maximum and 75% Quartile
value. It can be further inferred that the distribution is right skewed since for all the six items mean is greater
than the median which is an indication f right skewedness.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 440 entries, 0 to 439
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Buyer/Spender 440 non-null bool
1 Channel 440 non-null bool
2 Region 440 non-null bool
3 Fresh 440 non-null bool
4 Milk 440 non-null bool
5 Grocery 440 non-null bool
6 Frozen 440 non-null bool
7 Detergents_Paper 440 non-null bool
8 Delicatessen 440 non-null bool
dtypes: bool(9)
memory usage: 4.0 KB
From the above code, it can be said that no null values are present in the provided data set.
Total
Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
Spend
Lisbon 761233 228342 237542 184512 56081 70632 1538342
Hotel Oporto 326215 64519 123074 160861 13516 30965 719150
Other 2928269 735753 820101 771606 165990 320358 5742077
Lisbon 93600 194112 332495 46514 148055 33695 848471
Retail Oporto 138506 174625 310200 29271 159795 23541 835938
Other 1032308 1153006 1675150 158886 724420 191752 4935522
Based on above table Hotel channel gets the maximum expenditure and retail channel get the minimum expenditure.
Also, as per region wise, Oporto region has minimum expenditure and other region has maximum expenditure. Same can
be seen in the below figure. Also, Combination of Hotel and Other has maximum expenditure and combination of Hotel
and Oporto has minimum expenditure.
Q.1.2 There are 6 different varieties of items are considered. Do all varieties show similar behaviour across Region
and Channel?
From the above table it can be clearly inferred that the expenditure on the ‘Fresh’ items is visibility higher in Hotel channel
than in Retail channel. Also, in Hotel channel expenditure on the Fresh items is maximum in every region as compared to
Retail channel. Grocery items are major contributors of expenditure across the Retail channel in every region. ‘Frozen’
items and ‘Delicatessen’ items are the minimum contributors of expenditure in the Retail channel. In both the channels
other regions have majority of the expenditures.
It can be seen from above all the figures that no the behavior of all the 6 items doesn’t follow similar trend in both the
channels.
Q.1.3 On the basis of descriptive measure of variability, which item shows the most inconsistent behavior? Which items
show the least inconsistent behavior?
It can be seen from above figure that there are significant number of outliers present in the above data. Same has been
discussed in the first question using the description of data as the basis.
Study -2
The Student News Service at Clear Mountain State University (CMSU) has decided to gather data about the undergraduate
students that attend CMSU. CMSU creates and distributes a survey of 14 questions and receives responses from 62
undergraduates (stored in the Survey data set).
EDA:
Observations:
Dataset has 14 variables in it
1. 6 categorical variables - Gender, Class, major, Grad Intent, Employment and Computer.
2. 5 integer data type - Age, Social Networking, Satisfaction, Spending and Text Messages.
3. 2 float data type - GPA and Salary
2.1. For this data, construct the following contingency tables (Keep Gender as row variable)
2.2.1. What is the probability that a randomly selected CMSU student will be male?
2.3. Assume that the sample is representative of the population of CMSU. Based on the data, answer the following
question:
2.3.1. Find the conditional probability of different majors among the male students in CMSU.
2.4. Assume that the sample is a representative of the population of CMSU. Based on the data, answer the following
question:
2.4.1. Find the probability That a randomly chosen student is a male and intends to graduate.
2.4.2 Find the probability that a randomly selected student is a female and does NOT have a laptop.
2.5. Assume that the sample is representative of the population of CMSU. Based on the data, answer the following
question:
2.5.1. Find the probability that a randomly chosen student is either a male or has full-time employment?
2.5.2. Find the conditional probability that given a female student is randomly chosen, she is majoring in international
business or management.
2.6. Construct a contingency table of Gender and Intent to Graduate at 2 levels (Yes/No). The Undecided students are
not considered now, and the table is a 2x2 table. Do you think the graduate intention and being female are independent
events?
Total number of students = 40
Number of females = 20
Number of student Graduation Intent Yes = 28
Number of female Graduation Intent Yes = 11
Probability female = 20/40
Probability student Graduation Intent Yes = 28/40
P(A)*P(B) = 28/80
Probability female = 20/40
Probability Graduation Intent Yes | Female = 11/20
P(A|B) *P(B) = 11/40
P(A)*P(B) = 0.35
P(A|B) *P(B) = 0.275
The events are dependent.
2.7. Note that there are four numerical (continuous) variables in the data set, GPA, Salary, Spending, and Text Messages.
2.7.1. If a student is chosen randomly, what is the probability that his/her GPA is less than 3?
2.7.2. Find the conditional probability that a randomly selected male earns 50 or more. Find the conditional probability
that a randomly selected female earns 50 or more.
2.8. Note that there are four numerical (continuous) variables in the data set, GPA, Salary, Spending, and Text Messages.
For each of them comment whether they follow a normal distribution. Write a note summarizing your conclusions.
The box plots of all the numerical continuous variable seems to be almost normally distributed as shown below. Also,
mean and median of all the variables are almost equal which in itself is another evidence to say that they are normally
distributed. Also all the variable satisfy the empirical law of are between (µ-3σ, µ+3σ) = 99.7 %
Study -3
An important quality characteristic used by the manufacturers of ABC asphalt shingles is the amount of moisture the
shingles contain when they are packaged. Customers may feel that they have purchased a product lacking in quality if they
find moisture and wet shingles inside the packaging. In some cases, excessive moisture can cause the granules attached
to the shingles for texture and coloring purposes to fall off the shingles resulting in appearance problems. To monitor the
amount of moisture present, the company conducts moisture tests. A shingle is weighed and then dried. The shingle is
then reweighed and based on the amount of moisture taken out of the product, the pounds of moisture per 100 square
feet are calculated. The company would like to show that the mean moisture content is less than 0.35 pound per 100
square feet.
The file (A & B shingles.csv) includes 36 measurements (in pounds per 100 square feet) for A shingles and 31 for B shingles.
EDA:
There is float type data provided in the data set and it has two variables A and B.
The descriptive statistics of the data is shown below. It can be seen from below that standard deviation of both the samples
is in the same range, how ever means are different. Also, number of observations in column A is 36 and column B is 31.
Also, no null values observed in the dataset.
A B
count 36 31
mean 0.316667 0.273548
std 0.135731 0.137296
min 0.13 0.1
25% 0.2075 0.16
50% 0.29 0.23
75% 0.3925 0.4
max 0.72 0.58
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36 entries, 0 to 35
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 36 non-null float64
1 B 31 non-null float64
dtypes: float64(2)
memory usage: 704.0 bytes
Q.3.1 Do you think there is evidence that mean moisture contents in both types of shingles are within the permissible
limits? State your conclusions clearly showing all steps.
Sample A
For Sample A
𝐻0: 𝜇A <= 0.35
𝐻𝐴: 𝜇A > 0.35
Since the status quo being the moisture content less than 0.35, it is chosen as the hypothesis.
One sample t test, p-value, and t-statistic for one side T-test are
t-statistic: -1.4735046253382782 p-value: 0.07477633144907513
Step 4: Decide to reject or accept null hypothesis
Hence, there is not enough evidence to reject the claim of moisture content of shingles A less than 0.35 pound per 100
square feet, at the 0.05 significance level.
Sample B
Poppulation standard deviation is unknown. Though sample sizes are more than 30 it is better to opt for T-Test since
they are in border range.
For Sample A
𝐻0: 𝜇B <= 0.35
𝐻𝐴: 𝜇B > 0.35
Since the status quo being the moisture content less than 0.35, it is chosen as the hypothesis.
One sample t test, p-value, and t-statistic for one side T-test are
t-statistic: -3.1003313069986995 p-value: 0.0020904774003191826
Step 4: Decision to reject or accept null hypothesis
Q.3.2 Do you think that the population mean for shingles A and B are equal? Form the hypothesis and conduct the test
of the hypothesis. What assumption do you need to check before the test for equality of means is performed?
To perform a test for comparison of means, the samples should adhere to certain assumption, as stated below
a) Variance of two sample should be similar – which is being satisfied as shown in EDA
b) Samples Should be random and both the populations should be normally distributed – Since population data not
available but since sample size is more than 30, sample mean distribution is following normal distribution. Same
has been confirmed using empirical rule analysis as shown in Jupyter notebook
c) Outliers in the data should be minimal – same is shown in below figure and outliers are minimal
Based on above assumptions, two sample independent t-test can be performed for the data samples.
In testing whether the population mean for shingles A and B are equal, the null hypothesis states that the population
mean of shingles A and shingles B are the same, equals . The alternative hypothesis states that the population mean of
shingles A and shingles B are different, equals . We are going to use to two tail T test.
H0: 𝜇A - 𝜇B = 0 i.e. 𝜇A = 𝜇B
HA: 𝜇A - 𝜇B ≠ 0 i.e. 𝜇A ≠ 𝜇B
Level of significance: 0.05 and our two-sample t-test p-value= 0.2017496571835328, since p-value is greater than 0.05,
we have no evidence to reject the null hypothesis since p value > Level of significance
The results indicate that there is no significant difference between the population averages,
Whereas the Null Hypothesis of equality of means is accepted.
The results indicate that, at 95% confidence level, there is sufficient evidence to prove that mean moisture
content in A is equal to mean moisture content in B, which is accepting null hypothesis