0% found this document useful (0 votes)
214 views12 pages

SMDM Report

From the given document: - The data consists of annual spending amounts of 440 large retailers on 6 product categories across 3 regions and 2 sales channels in Portugal. - Exploratory data analysis found the data to have no null values, right skewed distributions, and many outliers. Fresh items showed the most consistent behavior while Delicatessen the least. - Based on the analysis, the recommendations are to focus on high demand areas like Fresh items in hotels and Grocery in retail, and to increase stock of in-demand Fresh items in other regions since it has the highest expenditures overall.

Uploaded by

Ruhee's Kitchen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
214 views12 pages

SMDM Report

From the given document: - The data consists of annual spending amounts of 440 large retailers on 6 product categories across 3 regions and 2 sales channels in Portugal. - Exploratory data analysis found the data to have no null values, right skewed distributions, and many outliers. Fresh items showed the most consistent behavior while Delicatessen the least. - Based on the analysis, the recommendations are to focus on high demand areas like Fresh items in hotels and Grocery in retail, and to increase stock of in-demand Fresh items in other regions since it has the highest expenditures overall.

Uploaded by

Ruhee's Kitchen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Study -1 Project

A wholesale distributor operating in different regions of Portugal has information on annual spending of several items in
their stores across different regions and channels. The data consists of 440 large retailers’ annual spending on 6
different varieties of products in 3 different regions (Lisbon, Oporto, Other) and across different sales channel (Hotel,
Retail).

Q.1.1: Exploratory data analysis:

Buyer/ Detergents
Channel Region Fresh Milk Grocery Frozen Delicatessen
Spender _Paper
1 Retail Other 12669 9656 7561 214 2674 1338
2 Retail Other 7057 9810 9568 1762 3293 1776
3 Retail Other 6353 8808 7684 2405 3516 7844
4 Hotel Other 13265 1196 4221 6404 507 1788
5 Retail Other 22615 5410 7198 3915 1777 5185

From the above table the data set has 6 different types of items. Buyer/Spender, Region and channel are categorical
data, whereas other being int type data which are measures of amount spent on different types of items.

count unique top freq mean std min 25% 50% 75% max
Buyer/Spender 440 NaN NaN NaN 220.5 127.16 1 110.8 221 330.3 440
Channel 440 2 Hotel 298 NaN NaN NaN NaN NaN NaN NaN
Region 440 3 Other 316 NaN NaN NaN NaN NaN NaN NaN
Fresh 440 NaN NaN NaN 12000 12647 3 3128 8504 16934 112151
Milk 440 NaN NaN NaN 5796 7380 55 1533 3627 7190 73498
Grocery 440 NaN NaN NaN 7951 9503 3 2153 4756 10656 92780
Frozen 440 NaN NaN NaN 3072 4855 25 742 1526 3554 60869
Detergents_Paper 440 NaN NaN NaN 2881 4768 3 257 817 3922 40827
Delicatessen 440 NaN NaN NaN 1525 2820 3 408 966 1820 47943

1) There are 3 unique values region column in which other has the most entries. There are 2 unique values in
Channel Column in which Hotel has more entries.
2) For all the 6 items, their standard deviation Is more than the mean.
3) All the 6 items have significant number of outliers which can be observed based on maximum and 75% Quartile
value. It can be further inferred that the distribution is right skewed since for all the six items mean is greater
than the median which is an indication f right skewedness.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 440 entries, 0 to 439
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Buyer/Spender 440 non-null bool
1 Channel 440 non-null bool
2 Region 440 non-null bool
3 Fresh 440 non-null bool
4 Milk 440 non-null bool
5 Grocery 440 non-null bool
6 Frozen 440 non-null bool
7 Detergents_Paper 440 non-null bool
8 Delicatessen 440 non-null bool
dtypes: bool(9)
memory usage: 4.0 KB
From the above code, it can be said that no null values are present in the provided data set.

Total
Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
Spend
Lisbon 761233 228342 237542 184512 56081 70632 1538342
Hotel Oporto 326215 64519 123074 160861 13516 30965 719150
Other 2928269 735753 820101 771606 165990 320358 5742077
Lisbon 93600 194112 332495 46514 148055 33695 848471
Retail Oporto 138506 174625 310200 29271 159795 23541 835938
Other 1032308 1153006 1675150 158886 724420 191752 4935522

Based on above table Hotel channel gets the maximum expenditure and retail channel get the minimum expenditure.
Also, as per region wise, Oporto region has minimum expenditure and other region has maximum expenditure. Same can
be seen in the below figure. Also, Combination of Hotel and Other has maximum expenditure and combination of Hotel
and Oporto has minimum expenditure.

Q.1.2 There are 6 different varieties of items are considered. Do all varieties show similar behaviour across Region
and Channel?

From the above table it can be clearly inferred that the expenditure on the ‘Fresh’ items is visibility higher in Hotel channel
than in Retail channel. Also, in Hotel channel expenditure on the Fresh items is maximum in every region as compared to
Retail channel. Grocery items are major contributors of expenditure across the Retail channel in every region. ‘Frozen’
items and ‘Delicatessen’ items are the minimum contributors of expenditure in the Retail channel. In both the channels
other regions have majority of the expenditures.

It can be seen from above all the figures that no the behavior of all the 6 items doesn’t follow similar trend in both the
channels.

Q.1.3 On the basis of descriptive measure of variability, which item shows the most inconsistent behavior? Which items
show the least inconsistent behavior?

index count mean std min 25% 50% 75% max CV


Fresh 440 12000 12647 3 3128 8504 16934 112151 1
Milk 440 5796 7380 55 1533 3627 7190 73498 1
Grocery 440 7951 9503 3 2153 4756 10656 92780 1
Frozen 440 3072 4855 25 742 1526 3554 60869 2
Detergents_Paper 440 2881 4768 3 257 817 3922 40827 2
For this question, Coefficient of variance has been calculated for the discrete integer data since this being the measure of
the dispersion or consistency while data of unequal means being compared. It can be seen from the above table that Fresh
has lowest coefficient of Variation hence is most consistent of all and Delicatessen has highest coefficient of Variation,
hence least consistent of all.

Q.1.4 Are there any outliers in the data?

It can be seen from above figure that there are significant number of outliers present in the above data. Same has been
discussed in the first question using the description of data as the basis.

Q.1.5 On the basis of this report, what are the recommendations?


Conclusion:
Based on given sample data, the wholesale distributor must focus on below observations:
• Fresh items are more in demand Hotel channel
• Grocery items are more in demand Retail channel
• other region has great demand for Fresh items; hence they must increase the stock of Fresh items.
• Delicatessen items seemed to be less in demand in all the regions.

Study -2
The Student News Service at Clear Mountain State University (CMSU) has decided to gather data about the undergraduate
students that attend CMSU. CMSU creates and distributes a survey of 14 questions and receives responses from 62
undergraduates (stored in the Survey data set).

EDA:

Observations:
Dataset has 14 variables in it
1. 6 categorical variables - Gender, Class, major, Grad Intent, Employment and Computer.
2. 5 integer data type - Age, Social Networking, Satisfaction, Spending and Text Messages.
3. 2 float data type - GPA and Salary
2.1. For this data, construct the following contingency tables (Keep Gender as row variable)

2.1.1. Gender and Major

2.1.2. Gender and Grad Intention

2.1.3. Gender and Employment

2.1.4. Gender and Computer


2.2. Assume that the sample is representative of the population of CMSU. Based on the data, answer the following
question:

2.2.1. What is the probability that a randomly selected CMSU student will be male?

Total number of students = 62


Number of male students = 29
Probability that a randomly selected CMSU student will be male = 29/62
P(Male) = 0.468
2.2.2. What is the probability that a randomly selected CMSU student will be female?

Total number of students = 62


Number of Female students = 33
Probability that a randomly selected CMSU student will be female = 33/62
P(Female) = 0.532

2.3. Assume that the sample is representative of the population of CMSU. Based on the data, answer the following
question:

Total number of students = 62


Number of males = 29
Number of females = 33

2.3.1. Find the conditional probability of different majors among the male students in CMSU.

• Probability of male_Accounting is 13.8


• Probability of male_CIS is 3.4
• Probability of male_Economics_Finance is 13.8
• Probability of male_International_Business is 6.9
• Probability of male_Management is 20.7
• Probability of male_Other is 13.8
• Probability of male_Retailing_Marketing is 17.2
• Probability of male_Undecided is 10.3
2.3.2 Find the conditional probability of different majors among the female students of CMSU.

• Probability of female_Accounting is 9.1


• Probability of female_CIS is 9.1
• Probability of female_Economics_Finance is 21.2
• Probability of female_International_Business is 12.1
• Probability of female_Management is 12.1
• Probability of female_Other is 9.1
• Probability of female_Retailing_Marketing is 27.3
• Probability of female_Undecided is 0.0

2.4. Assume that the sample is a representative of the population of CMSU. Based on the data, answer the following
question:

2.4.1. Find the probability That a randomly chosen student is a male and intends to graduate.

Probability of male and intends to gradute is 27.4%

2.4.2 Find the probability that a randomly selected student is a female and does NOT have a laptop.

Probability of female with no laptop is 6.5%

2.5. Assume that the sample is representative of the population of CMSU. Based on the data, answer the following
question:

2.5.1. Find the probability that a randomly chosen student is either a male or has full-time employment?

Probability of either male or fully employed is 74.2%

2.5.2. Find the conditional probability that given a female student is randomly chosen, she is majoring in international
business or management.

Probability of female in international business management is 12.1%

2.6. Construct a contingency table of Gender and Intent to Graduate at 2 levels (Yes/No). The Undecided students are
not considered now, and the table is a 2x2 table. Do you think the graduate intention and being female are independent
events?
Total number of students = 40

Number of females = 20
Number of student Graduation Intent Yes = 28
Number of female Graduation Intent Yes = 11
Probability female = 20/40
Probability student Graduation Intent Yes = 28/40
P(A)*P(B) = 28/80
Probability female = 20/40
Probability Graduation Intent Yes | Female = 11/20
P(A|B) *P(B) = 11/40
P(A)*P(B) = 0.35
P(A|B) *P(B) = 0.275
The events are dependent.
2.7. Note that there are four numerical (continuous) variables in the data set, GPA, Salary, Spending, and Text Messages.

Answer the following questions based on the data

2.7.1. If a student is chosen randomly, what is the probability that his/her GPA is less than 3?

Total number of students = 62


Number of students’ GPA less than 3 = 17
Probability student’s GPA less than 3 = 17/62 = 0.27

2.7.2. Find the conditional probability that a randomly selected male earns 50 or more. Find the conditional probability
that a randomly selected female earns 50 or more.

Probability of male earning more than 50 = 14/29 = 48.3%


Probability of female earning more than 50 = 18/33 = 54.5%

2.8. Note that there are four numerical (continuous) variables in the data set, GPA, Salary, Spending, and Text Messages.
For each of them comment whether they follow a normal distribution. Write a note summarizing your conclusions.

The box plots of all the numerical continuous variable seems to be almost normally distributed as shown below. Also,
mean and median of all the variables are almost equal which in itself is another evidence to say that they are normally
distributed. Also all the variable satisfy the empirical law of are between (µ-3σ, µ+3σ) = 99.7 %
Study -3
An important quality characteristic used by the manufacturers of ABC asphalt shingles is the amount of moisture the
shingles contain when they are packaged. Customers may feel that they have purchased a product lacking in quality if they
find moisture and wet shingles inside the packaging. In some cases, excessive moisture can cause the granules attached
to the shingles for texture and coloring purposes to fall off the shingles resulting in appearance problems. To monitor the
amount of moisture present, the company conducts moisture tests. A shingle is weighed and then dried. The shingle is
then reweighed and based on the amount of moisture taken out of the product, the pounds of moisture per 100 square
feet are calculated. The company would like to show that the mean moisture content is less than 0.35 pound per 100
square feet.

The file (A & B shingles.csv) includes 36 measurements (in pounds per 100 square feet) for A shingles and 31 for B shingles.

EDA:

There is float type data provided in the data set and it has two variables A and B.

The descriptive statistics of the data is shown below. It can be seen from below that standard deviation of both the samples
is in the same range, how ever means are different. Also, number of observations in column A is 36 and column B is 31.
Also, no null values observed in the dataset.

A B
count 36 31
mean 0.316667 0.273548
std 0.135731 0.137296
min 0.13 0.1
25% 0.2075 0.16
50% 0.29 0.23
75% 0.3925 0.4
max 0.72 0.58
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36 entries, 0 to 35
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 36 non-null float64
1 B 31 non-null float64
dtypes: float64(2)
memory usage: 704.0 bytes
Q.3.1 Do you think there is evidence that mean moisture contents in both types of shingles are within the permissible
limits? State your conclusions clearly showing all steps.

𝜇A be average moisture content in sample A


𝜇B be average moisture content in sample B

Sample A

Step 1: Defining null and alternative hypotheses


Poppulation standard deviation is unknown. Though sample sizes are more than 30 it is better to opt for T-Test since
they are in border range.

For Sample A
𝐻0: 𝜇A <= 0.35
𝐻𝐴: 𝜇A > 0.35
Since the status quo being the moisture content less than 0.35, it is chosen as the hypothesis.

Step 2: Deciding the significance

Since the 𝛼 is not given so here we select 𝛼(significane level) = 0.05


The sample size for the sample A is 36.
Degree of freedom for sample A is 35

Step 3: Calculate the p - value and test statistic

One sample t test, p-value, and t-statistic for one side T-test are
t-statistic: -1.4735046253382782 p-value: 0.07477633144907513
Step 4: Decide to reject or accept null hypothesis

p value is 0.07477633144907513 and it is greater than 5% level of significance


So, the statistical decision is failing to reject the null hypothesis at 5% level of significance

Hence, there is not enough evidence to reject the claim of moisture content of shingles A less than 0.35 pound per 100
square feet, at the 0.05 significance level.

Sample B

Step 1: Defining null and alternative hypotheses

Poppulation standard deviation is unknown. Though sample sizes are more than 30 it is better to opt for T-Test since
they are in border range.

For Sample A
𝐻0: 𝜇B <= 0.35
𝐻𝐴: 𝜇B > 0.35
Since the status quo being the moisture content less than 0.35, it is chosen as the hypothesis.

Step 2: Deciding the significance

Since the 𝛼 is not given so here we select 𝛼(significane level) = 0.05


The sample size for the sample B is 31.
Degree of freedom for sample B is 30

Step 3: Calculate the p - value and test statistic

One sample t test, p-value, and t-statistic for one side T-test are
t-statistic: -3.1003313069986995 p-value: 0.0020904774003191826
Step 4: Decision to reject or accept null hypothesis

p value is 0.0020904774003191826 and it is less than 5% level of significance


So, the statistical decision is failing to reject the null hypothesis at 5% level of significance
Hence, there is sufficient evidence to reject the claim of moisture content of shingles B less than 0.35 pound per 100
square feet, at the 0.05 significance level.

Q.3.2 Do you think that the population mean for shingles A and B are equal? Form the hypothesis and conduct the test
of the hypothesis. What assumption do you need to check before the test for equality of means is performed?

To perform a test for comparison of means, the samples should adhere to certain assumption, as stated below

a) Variance of two sample should be similar – which is being satisfied as shown in EDA
b) Samples Should be random and both the populations should be normally distributed – Since population data not
available but since sample size is more than 30, sample mean distribution is following normal distribution. Same
has been confirmed using empirical rule analysis as shown in Jupyter notebook
c) Outliers in the data should be minimal – same is shown in below figure and outliers are minimal

Based on above assumptions, two sample independent t-test can be performed for the data samples.

Step 1: Define null and alternative hypotheses

In testing whether the population mean for shingles A and B are equal, the null hypothesis states that the population
mean of shingles A and shingles B are the same, equals . The alternative hypothesis states that the population mean of
shingles A and shingles B are different, equals . We are going to use to two tail T test.

H0: 𝜇A - 𝜇B = 0 i.e. 𝜇A = 𝜇B
HA: 𝜇A - 𝜇B ≠ 0 i.e. 𝜇A ≠ 𝜇B

Step 2: Defining the significance level

Since the 𝛼 is not given so here we select 𝛼(significane level) = 0.05


Sample sizes for both samples are not same
Degree of freedom of the test is 36+31-2 = 65

Step 3: Calculating the p - value and test statistic

Two sample t test


t-statistic: 1.2896282719661123 p-value: 0.2017496571835306
Step 4: Decision to reject or accept null hypothesis

Level of significance: 0.05 and our two-sample t-test p-value= 0.2017496571835328, since p-value is greater than 0.05,
we have no evidence to reject the null hypothesis since p value > Level of significance
The results indicate that there is no significant difference between the population averages,
Whereas the Null Hypothesis of equality of means is accepted.

The results indicate that, at 95% confidence level, there is sufficient evidence to prove that mean moisture
content in A is equal to mean moisture content in B, which is accepting null hypothesis

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy