Data and Summarization
Data and Summarization
Data and Summarization
Shovan Chowdhury
17-12-2020 EPGP 13
Course Objective
• Familiarity with different types of data and their visualization
• Understanding presence of intrinsic uncertainty in a business
situation
• Use of appropriate statistical techniques for modelling data and
capturing uncertainty
• Applying statistical software for data analysis
• Interpreting outputs from a managerial aspect (may require
knowledge of other disciplines)
• Developing basic expertise of the course to understand the other
areas
17-12-2020 EPGP 13
“Statistical Techniques/Methods”
Do some Interpret
statistical results
calculations
17-12-2020 EPGP 13
DATA AND SUMMARIZATION
Cola Exclusivity Agreement
A large university with a total enrollment of about 50,000
students has offered one Cola company (Soft) an exclusivity
agreement that would give the company exclusive rights to
sell its products at all university facilities for the next year
with an option for future years.
17-12-2020 EPGP 13
Cola Exclusivity Agreement
The market for soft drinks is measured in terms of 200 ml bottles.
17-12-2020 EPGP 13
Cola Exclusivity Agreement
A quick analysis reveals that if its current market share were
25%, then, with an exclusivity agreement, Soft would sell
88,000 (22,000 is 25% of 88,000) bottles per week or
3,520,000 bottles per year.
17-12-2020 EPGP 13
Cola Exclusivity Agreement
Cola assigned a recent university graduate to survey the
university's students to supply the missing information.
17-12-2020 EPGP 13
Inferential statistics
The information we would like to acquire in is an estimate of
annual profits from the exclusivity agreement. The data are
the numbers of bottles of soft drinks consumed in 7 days by
the 500 students in the sample.
17-12-2020 EPGP 13
Inferential statistics
Inferential statistics is a body of methods used to draw
conclusions or inferences about characteristics of populations
based on sample data. The population in question in this case
is the soft drink consumption of the university's 50,000
students. The cost of interviewing each student would be
prohibitive and extremely time consuming. Statistical
techniques make such endeavors unnecessary. Instead, we
can sample a much smaller number of students (the sample
size is 500) and infer from the data the number of soft drinks
consumed by all 50,000 students. We can then estimate
annual profits for the cola company.
17-12-2020 EPGP 13
Primary Uses of Statistics
17-12-2020 EPGP 13
Problem
One Chocolate manufacturing company sells quality chocolate products at its plant
and retail stores. Two years ago, the company developed a Web site and began
selling its products over the Internet. Web site have exceeded the company’s
expectations, and management is now considering strategies to increase sales even
further. To learn more about the Web site customers, a sample of 50 Chocolate
transactions was selected from the previous month’s sales.
Data showing
the day of the week each transaction was made,
the type of browser the customer used,
the time spent on the Web site,
the number of Web site pages viewed,
the amount spent by each of the 50 customers.
17-12-2020 EPGP 13
The Cab Case (Text Book) : Demand Supply Gap
17-12-2020 EPGP 13
The Car Mileage Case: Estimating Mileage
17-12-2020 EPGP 13
The Care Mileage Case: The Data
17-12-2020 EPGP 13
Basic Vocabulary of Statistics
POPULATION
A population consists of all the items or individuals about which
you want to draw a conclusion.
SAMPLE
A sample is the portion of a population selected for analysis.
PARAMETER
A parameter is a numerical measure that describes a
characteristic of a population.
STATISTIC
A statistic is a numerical measure that describes a characteristic
of a sample.
17-12-2020 EPGP 13
Qualitative(Categ
Quantitative
orical)
Discrete (no.
of customers, Ordinal (customer
no of claims) satisfaction,
efficiency of workers,
bond rating)
Continuous
(salary, price)
Nominal (sex,
nationality,
eye color)
17-12-2020 EPGP 13
Cross-Sectional Data
17-12-2020 EPGP 13
Data Visualization
17-12-2020 EPGP 13
Some quick questions
- Return on investment
- Project completion time
- Mutual fund ratings
- Political affiliation
-Demand for a product
- No of customers waiting in a queue
- Diameter of bolts
- Number of defectives produced in a shift
- Gender
-No of misprints per page of a book
- Marital Status
- Efficiency of employee
Excel Bar and Pie Chart of Pizza Preference
Data
17-12-2020 EPGP 13
Histogram
17-12-2020 EPGP 13
Summary for WaitTime
A nderson-D arling N ormality Test
A -S quared 0.24
P -V alue 0.759
M ean 5.4600
S tDev 2.4755
V ariance 6.1279
S kew ness 0.250415
Kurtosis -0.404960
N 100
M inimum 0.4000
1st Q uartile 3.8000
M edian 5.2500
3rd Q uartile 7.2000
0 2 4 6 8 10 12
M aximum 11.6000
95% C onfidence Interv al for M ean
4.9688 5.9512
95% C onfidence Interv al for M edian
4.5742 5.8773
95% C onfidence Interv al for S tD ev
9 5 % C onfidence Inter vals
2.1735 2.8757
Mean
Median
17-12-2020 EPGP 13
Rating Distribution of Sample
35%
30%
25%
20%
15%
10%
5%
0%
1 2 3 4 5
60%
40%
20%
0%
M F
25
20
15
10
0
17-12-2020 EPGP 13
Skewness
Skewed to left
17-12-2020 EPGP 13
Skewness
Symmetric
17-12-2020 EPGP 13
Skewness
Skewed to right
17-12-2020 EPGP 13
What is the point? Why collect this data?
17-12-2020 EPGP 13
Data and randomness
17-12-2020 EPGP 13
Dispersion
• Describes how similar a set of observations are to each other
or
the degree of deviation (spread) of a set of data from their central
value
• In general, the more spread out a distribution is, the larger the measure of
dispersion will be
17-12-2020 EPGP 13
Measures of Dispersion
125
75
because the scores 50
0
1 2 3 4 5 6 7 8 9 10
17-12-2020 EPGP 13
Measures of Dispersion
17-12-2020 EPGP 13
Interpretation
• The larger the SD/variance is, the more the observations deviate, on
average, away from the mean
• The smaller the SD/variance is, the less the observations deviate, on
average, from the mean
17-12-2020 EPGP 13
Coefficient of Variation (CV)
s
CV = 100
x
17-12-2020 EPGP 13
Percentiles, Quartiles and IQR
17-12-2020 EPGP 13
Percentiles, quartiles, and the IQR
17-12-2020 EPGP 13
Box Plot
17-12-2020 EPGP 13
Box Plot
17-12-2020 EPGP 13
Detection of Outliers (Box Plot)
83 84 85 86 87 88 89 90 91
IBM
BoxPlot
17-12-2020 EPGP 13
A large number of fast-food restaurants with drive-through
windows offering drivers and their passengers the
advantages of quick service. To measure how good the
service is, an organization called QSR planned a study
wherein the amount of time taken by a sample of drive-
through customers at each of five restaurants was
recorded. Compare the five sets of data using a box plot
and interpret the results.
17-12-2020 EPGP 13
Box Plots…
17-12-2020 EPGP 13
Standardising Data
• Purpose: To compare each data point to the natural
range and variation of the dataset.
• Method: For each data value – subtract off sample
mean and divided by sample std dev.
Resulting numbers called z-values or z-scores
• measure how many standard deviations above or
below the mean a data point is.
• are “unit free”
• have mean zero and SD 1
17-12-2020 EPGP 13
Standardising Data
17-12-2020 EPGP 13
Capturing variation
⚫ Chebyshev’s Theorem
Applies to any distribution, regardless of shape
⚫ Empirical Rule
Applies only to roughly mound-shaped and symmetric
distributions
17-12-2020 EPGP 13
Chebyshev’s Theorem
1
⚫ At least 1 −
2 of the elements of any
k
distribution lie within k standard deviations of the
mean
1 1 3
1− = 1 − = = 75%
2
2
4 4 2
Standard
At 1 1 8 Lie
1 − 2 = 1 − = = 89% 3 deviations
least 3 9 9 within of the mean
1 1 15 4
1− 2 = 1− = = 94%
4 16 16
17-12-2020 EPGP 13
Empirical Rule
⚫ For roughly mound-shaped and symmetric
distributions, approximately:
m
x
m – 3s m – 1s m + 1s m + 3s
m – 2s m + 2s
17-12-2020 EPGP 13
A survey is conducted on 20 respondents to gather information on customer satisfaction for a product.
The data on customer satisfaction is obtained on a 3 point scale viz. highly satisfied (HS), satisfied (S), not satisfied
(NS) and also on gender- male (M) and female (F). The data is recorded as shown below:
Respondents 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Gender M F F F F F F M M M M M F F F F M F M M
Satisfaction S S NS S NS NS NS NS HS HS S NS HS S S HS NS NS S S
level
17-12-2020 EPGP 13
Scatter Plots and Correlation
17-12-2020 EPGP 13
Scatter Plot Examples
Linear relationships Curvilinear relationships
y y
x x
y y
x x
17-12-2020 EPGP 13
Strong relationships Weak relationships
y y
x x
y y
x x
17-12-2020 EPGP 13
No relationship
x
17-12-2020 EPGP 13
Correlation Coefficient
17-12-2020 EPGP 13
Calculating sample Correlation Coefficient
cov( x, y )
rxy =
sx s y
1
cov( x, y ) = ( xi − x )( yi − y )
n
1 1
sx =
n
( xi − x ) 2
s y =
n
( y i − y ) 2
17-12-2020 EPGP 13
Features of correlation coefficient
• Unit free
• Range between -1.00 and 1.00
• -1≤r<0 implies that as X ↑ (↓), Y ↓ (↑ )
• 0< r≤1 implies that as X ↑ (↓), Y ↑ (↓)
• The closer to -1.00, the stronger the negative linear relationship
• The closer to 1.00, the stronger the positive linear relationship
• The closer to 0.00, the weaker the linear relationship
• r=0 implies that X and Y are not linearly associated
17-12-2020 EPGP 13
Examples of Approximate r Values
y y y
x x x
r = -1.00 r = -.60 r = 0.00
y y
x x
17-12-2020 r = 0.20 EPGP 13 r = 1.00