0% found this document useful (0 votes)
18 views

Week 01

Uploaded by

gibawav948
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Week 01

Uploaded by

gibawav948
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 71

GE 204/ FENS 200

PROBABILITY & STATISTICS

Week 1: Introduction to
Statistics

1
Dealing with Uncertainty

Everyday decisions are based on incomplete


information

Consider:
 The price of IBM stock will be higher in six months
than it is now.

 If the federal budget deficit is as high as predicted,


interest rates will remain high for the rest of the year.

2
Dealing with Uncertainty
(continued)

Because of uncertainty, the statements


should be modified:

 The price of IBM stock is likely to be higher in six


months than it is now.

 If the federal budget deficit is as high as predicted, it


is probable that interest rates will remain high for the
rest of the year.

3
What does “Statistics” mean?
1. Numerical data
“According to statistics, this year’s exports has marked a record!”
“We need to collect statistics for the productivity of this business.”

2. Collection of theories, rules and techniques


“This company uses statistics in their quality control system.”
“This faculty teaches statistics.”

3. Meaning for a “Statistician”


You will learn this once you excel “statistics”. 

4
Descriptive and Inferential Statistics

Two branches of statistics:


 Descriptive statistics
 Collecting, summarizing, and processing data
to transform data into information
 Inferential statistics
 provide the bases for predictions, forecasts,
and estimates that are used to transform
information into knowledge
5
Descriptive Statistics

 Collect data
 e.g., Survey
 Present data
 e.g., Tables and graphs
 Summarize data
 e.g., Sample mean =  X i

n
6
Inferential Statistics
 Estimation
 e.g., Estimate the population
mean weight using the sample
mean weight
 Hypothesis testing
 e.g., Test the claim that the
population mean weight is 120
pounds

Inference is the process of drawing conclusions or making decisions


about a population based on sample results
7
The Decision Making Process
Decision

Knowledge
Experience, Theory,
Literature, Inferential
Statistics, Computers
Information
Descriptive Statistics,
Begin Here: Probability, Computers
Data
Identify the
Problem
8
Key Definitions

 A population is the collection of all items of interest or


under investigation
 N represents the population size

 A sample is an observed subset of the population


 n represents the sample size

 A parameter is a specific characteristic of a population


 A statistic is a specific characteristic of a sample

9
Population vs. Sample

Population Sample

a b cd
ef ghi jkl m n b c
o p q rs t u v w g i n
x y z o r u
y

Values calculated using Values computed from


population data are called sample data are called
parameters statistics 10
Examples of Populations

 Names of all registered voters in Turkey


 Incomes of all families living in Fatih/Istanbul
 Annual returns of all stocks traded on the
Istanbul Stock Exchange
 Grade point averages of all the students in
Kadir Has University

11
Why “Sampling”?
 Less time consuming than a census

 Less costly to administer than a census

 It is possible to obtain statistical results of a


sufficiently high precision based on samples.

12
Process of Statistical Data Analysis

Population

Random
Make Inferences
Sample
Describe
Sample
Statistics

13
Data Types

Data

Qualitative Quantitative
(Categorical) (Numerical)

Examples:
 Marital Status
 Political Party Discrete Continuous
 Eye Color
(Defined categories) Examples: Examples:
 Number of Children  Weight
 Defects per hour  Voltage
(Counted items) (Measured
characteristics) 14
Data Types

 Time Series Data


 Ordered data values observed over time

 Cross Section Data


 Data values observed at a fixed point in time

15
Data Types

Sales (in $1000’s)


2003 2004 2005 2006 Time
Atlanta 435 460 475 490 Series
Boston 320 345 375 395 Data

Cleveland 405 390 410 395


Denver 260 270 285 280

Cross Section
Data
16
Measurement Levels
Differences between
measurements, true Ratio Data
zero exists
Quantitative Data

Differences between
measurements but no Interval Data
true zero

Ordered Categories
(rankings, order, or Ordinal Data
scaling)
Qualitative Data

Categories (no
ordering or direction) Nominal Data
17
Measurement Levels-EXAMPLES
 Nominal: sex, eye-colour
 Percentages, frequency, mod (most frequent value)

 Ordinal: socio-economic status: high (A), mid to


high (B), low to mid (C), low (D)
Very commonly used as likert scale (e.g., in surveys
after a statement as strongly agree, agree, neither
agree nor disagree, disagree, strongly disagree)
 In addition to nominal: median (middle) value,
quartiles
 No addition/subtraction/multiplication/division
among measurement levels for nominal/ordinal
data 18
Measurement Levels-EXAMPLES
 Interval: temperature, welfare, utility, IQ level
 In addition to ordinal: Mean and variance but not
any ratios (e.g., coefficient of variation)
 Celsius degrees: 100 units between freezing
(0C) and boiling points of water (100C)
 Fahrenheit degrees: 180 units between freezing
(32F) and boiling (212F) points of water.

Different reference points (0 C vs. 32 F)

F = 32 + 1.8 C

40C is NOT twice the 20C, since it is also 104F
and 68F.
 Generally F1 = 32 + 1.8 C and F2 = 32 + 1.8 (2*C), 19
hence F ≠ 2*F .
Measurement Levels-EXAMPLES
 Ratio: units of kg, meter, TL
 A reference point (e.g., 0 is available regardless
of the unit)
 In addition to interval: any ratios (e.g., coefficient
of variation)
 1km = 0.6214 miles, 1kg = 2.2046 pounds (there
is no constant term in conversion).
 It may be discrete or continuous (mostly
rounded numbers are used).

20
Measurement Levels-EXERCISE
 Occupation, City
 Education
 Price
 Likeness

21
Descriptive statistics
 Compute and interpret statistics describing the
location of a set of values, such as the mean
and median.
 Compute and interpret statistics describing the
variability in a set of values, such as the range
and standard deviation.
 Compute and interpret the measures of shape,
skewness and kurtosis.
 Produce graphical displays of data.

22
Some Frequently Used Statistics and
Parameters
SAMPLE POPULATION
STATISTICS PARAMETERS
MEAN x 

VARIANCE s2 

STANDARD s 
DEVIATION
 
PROPORTION ˆ 

23
Measure of Location
 Descriptive statistics that locate the center
of your data are called measures of
central tendency
 Sample Mean
 The sample mean of a set of n
measurements (x1, x2,…xn) is equal
to the sum of the measurements
divided by n.
n
xi x1  x2  ...  xn
x  
i 1 n n
24
Measure of Location
 Median
 Median: the “middle” value (also known as the 50th percentile)
 The median of a set of n measurements (x , x ,…x ) is the
1 2 n
value that falls in the middle position when the
measurements are ordered from the smallest to the
largest.
 x n1 if n is odd
 2
~
x  x n  x n
 2 2
1
if n is even
 2

x1,…xn are arranged in increasing order of magnitude25


RULE FOR CALCULATING THE
MEDIAN

 1. Order the measurements from the smallest to the


largest.
 2. A) If the sample size is odd, the median is the
middle measurement.
 B) If the sample size is even, the median is the
average of the two middle measurements.

26
1 3 3 4 5 8 51 13345 8
n=3 n=3 n=3 n=3
Median=4 Median=3
.5
(3+4)/2=3.5

27
Example
A random sample of six values were
taken from a population. These values were:

x1=7, x2=1, x3=10, x4=8, x5=4, and x6=12.

What are the sample mean and


sample median for these data?

28
Example (con’t)
x1  x2  x3  x4  x5  x6 7  1  10  8  4  12
x  7
n 6

Order Sample

x2=1, x5=4, x1=7, x4=8, x3=10, x6=12

MEDIAN = ( 7 + 8 ) / 2 = 7.5

29
Example

Consider the following sample:


4 18 36 39 41 42 43 44 44 45
46 47 48 49 49 50 51 53 54 60

Which measure of central tendency best describes


the central location of the data:

THE SAMPLE MEAN OR SAMPLE MEDIAN? Why?

30
Example (con’t)
n

x i
x  i 1 43.15
n
~ 45  46
x 45.5
2
the median

31
Example (con’t)
Why?
Because there is an outlier (extreme value),4 in
the data set, the mean is heavily influenced
by this single outlier.
Solution:
Trimmed mean—drop the outlier and
recalculate the mean.
 n 
  xi   4
xtrim   i 1  45.21
n 1
32
Mode
 A measure of location
 The value that occurs most often
 Not affected by extreme values
 Used for either numerical or categorical data
 There may be no mode
 There may be several modes

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6

Mode = 5 No Mode 33
Mode (con’t)
 What is the mode for the previous example
(slide 12)?
 44 (occurs twice)
 49 (occurs twice)

34
Distributions
 When you examine the distribution of values,
you can determine
 the range of possible data values
 the frequency of data values
 whether the data values accumulate in the
middle of the distribution or at one end.
 Median, mean and mode values have
relation with the shape of the distribution.

35
Measures of Central Tendency:
Shape of a Distribution
 Describes how data is distributed
 Symmetric or skewed
 Apply to many unimodal distributions (not
for multimodal)
Left-Skewed Symmetric Right-Skewed

Mean < Median < Mode Mode = Mean = Median Mode < Median < Mean
(Longer tail extends to left) (Longer tail extends to right)
36
Percentiles and Quartiles

Percentiles Quartiles

The pth percentile in a data array:  1st quartile = 25th percentile


 p% are less than or equal to this
value  2nd quartile = 50th percentile
 (100 – p)% are greater than or
= median
equal to this value
(where 0 ≤ p ≤ 100)  3rd quartile = 75th percentile

37
Percentiles and Quartiles
98
95 third quartile
92 75 Percentile=91
th

90
85
81 50th Percentile=80 (median)
79 Quartiles break your data
70 up into quarters.
63 25th Percentile=59
55 first quartile
47
42
38
Weighted Mean
 Used when values are grouped by frequency
or relative importance

Example: Sample of
26 Repair Projects
Weighted Mean Days
Days to
Complete
Frequency to Complete:
5 4
XW 
w x
i i

(4 5)  (12 6)  (8 7)  (2 8)
6 12 w i 4  12  8  2
7 8 164
  6.31 days
8 2 26

39
Measures of Variation

 Measures of variation give information on


the spread or variability of the data
values.

Same center,
different variation
40
The Spread of a Distribution:
Variation
Measure Definition
range the difference between the maximum and minimum
data values
interquartile range the difference between the 25th and 75th
percentiles (IR or IQR)
variance a measure of dispersion of the data around the
mean
standard deviation a measure of dispersion expressed in the same units
of measurement as your data (the square root of the
variance)
coefficient of standard deviation as a percentage of
variation of the mean

41
Range
 Simplest measure of variation
 Difference between the largest and the
smallest observations:

Range = xmaximum – xminimum

Example:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Range = 14 - 1 = 13
42
Disadvantages of the Range
 Ignores the way in which data are distributed

7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5

 Sensitive to outliers

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
43
Interquartile Range

 Can eliminate some outlier problems by using


the interquartile range

 Eliminate some high-and low-valued


observations and calculate the range from the
remaining values.

 Interquartile range = 3rd quartile – 1st quartile

44
Variance and Standard Deviation
•The variance is a measure of variation (σ2 or s2).
•The square root of the variance, or standard
deviation (σ or s), is a measure of variation in
terms of the original linear scale (most commonly
used).
    2 is the population standard deviation

 s  s 2 is sample standard deviation.

45
Measures of Variability (Population)
 Population Range
XMax-XMin
 Population Variance
n n

 i  xi
2
( x   ) 2

 2  i 1  i 1  2
N N

 Population Standard Deviation

  2

46
PROOF

47
Measures of Variability (Sample)
 Sample Range
XMax-XMin
 Sample Variance
2
 n

n
  x
 i 1 
i

( xi  x ) 2 
2
n
xi 
n
s 
2
 i 1
i 1 n 1 n 1

 Sample Standard Deviation

s  s2
48
Measures of Variability (Sample)
2
Obs. xi xi  x ( xi  x ) Obs.
2 xi xi

1 7 0 1 7
0 49
2 1 -6 2 1
36 1
3 10 3 3 10
9 100
4 8 1 4 8
1 64
80 5 424 374
5 4 -3
9 16
6 12 5 6 12 49
Sample Variance
2
 n

n n
  xi 
   i 1 
 i 
2 2
x  x xi 
n
S2  i 1 2
S  i 1
n 1 n 1

374 
 42 2

80 6
 
5 5
16 16
50
Sample Variance
• Calculate the sample variance by averaging
with n-1 instead of n.
n

 (x  x)
i
2

s 2  i 1
n 1
• n-1 is called the degrees of freedom
associated with the variance estimate. This
depicts the number of independent pieces of
information available for computing variability.
51
Comparing Standard Deviations
Same mean, but different
Data A standard deviations:
Mean = 15.5

11 12 13 14 15 16 17 18 19 20 21
s = 3.338

Data B
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = .9258

Data C
Mean = 15.5

11 12 13 14 15 16 17 18 19 20 21
s = 4.57
52
Coefficient of Variation
 Measures relative variation
 Always in percentage (%)
 Shows variation relative to mean
 It is used to compare two or more sets of data
measured in different units

Population Sample

σ  s 
CV   100% CV   100%

μ  x 
53
Comparing Coefficients
of Variation
 Stock A:
 Average price last year = $50
 Standard deviation = $5
s $5
 
CVA   100%  100% 10%
x $50 Both stocks
 Stock B: have the same
standard
 Average price last year = $100 deviation, but
 Standard deviation = $5 stock B is less
variable relative
to its price
s $5
CVB   100%  100% 5%
x $100
54
 Presentation of Data
 Tables
 Graphs
 Frequency displays and Histograms
 Stem-leaf display
Stem and Leaf Diagram

 A simple way to see distribution details


from qualitative data
METHOD
1. Separate the sorted data series into leading digits
(the stem) and the trailing digits (the leaves)
2. List all stems in a column from low to high
3. For each stem, list all associated leaves
Example:

Data sorted from low to high:


12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

 Here, use the 10’s digit for the stem unit:

Stem Leaf
 12 is shown as 1 2

 35 is shown as 3 5
Example:

Data in ordered array:


12, 13, 17, 21, 24, 24, 26, 27, 28, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

 Completed Stem-and-leaf diagram:


Stem Leaves
1 2 3 7
2 1 4 4 6 7 8
3 0 2 5 7 8
4 1 3 4 6
5 3 8
Using other stem units
 Using the 100’s digit as the stem:

 Round off the 10’s digit to form the leaves

Stem Leaf
 613 would become 6 1
 776 would become 7 8
 ...
 1224 becomes 12 2
Construction of a Stem-Leaf Display
 List the stem values, in order, in a vertical column
 Draw a vertical line to the right of the stem values
 For each observation, record the leaf portion of the
observation in the row corresponding to the appropriate
stem
 Reorder the leaves from the lowest to highest within
each stem row
 If the number of leaves appearing in each stem is too
large, divide the stems into two groups, the first
corresponding to leaves 0 through 4, and the second
corresponding to leaves 5 through 9. (This subdivision
can be increased to five groups if necessary).
EXAMPLE: Car Battery Life
2.2 4.1 3.5 4.5 3.2 3.7 3.0 2.6

3.4 1.6 3.1 3.3 3.8 3.1 4.7 3.7

2.5 4.3 3.4 3.6 2.9 3.3 3.9 3.1

3.3 3.1 3.7 4.4 3.2 4.1 1.9 3.4

4.7 3.8 3.2 2.6 3.9 3.0 4.2 3.5


Stem and Leaf Plot of Battery Life
STEM LEAF
Frequency
1 69 2
2 25669 5
3 0011112223334445567778899
25
4 11234577
8
Relative Frequency Distribution
 Group data into different classes or intervals
 Counting leaves belonging to each stem
 Each stem defines a class interval
 Divide each class frequency by the total
number of observations, we obtain the
proportion of the set of observations in each
of the classes.
Relative Frequency Distribution of Battery
Life

Class Interval Class midpoint Frequency, f Relative frequency


1.5-1.9 1.7 2 0.05
2.0-2.4 2.2 1 0.025
2.5-2.9 2.7 4 0.100
3.0-3.4 3.2 15 0.375
3.5-3.9 ? ? ?
4.0-4.4 ? ? ?
4.5-4.9 ? ? ?
Relative Frequency Distribution of Battery
Life (con’t)
Class Interval Class Frequency, Relative
midpoint f frequency
1.5-1.9 1.7 2 0.05
2.0-2.4 2.2 1 0.025
2.5-2.9 2.7 4 0.100

3.0-3.4 3.2 15 0.375

3.5-3.9 3.7 10 0.250

4.0-4.4 4.2 5 0.125

4.5-4.9 4.7 3 0.075

EXERCISE: Compute the sample mean and standard deviation


Picturing Distributions: Histogram
 Each bar in the
histogram represents
a group of values (a
PERCENT

bin).
 The height of the bar
is the percent of
values in the bin.

Bins
Relative Frequency Histogram of Battery
Life
How Many Class Intervals?

 Many (Narrow class intervals)


3.5

 may yield a very jagged 3


2.5

distribution with gaps from empty

Frequency
2
1.5

classes 1
0.5
 Can give a poor indication of how 0

4
8
12
16
20
24
28
32
36
40
44
48
52
56
60
More
frequency varies across classes Temperature

12

10
 Few (Wide class intervals) 8

Frequency
6
 may compress variation too much 4

and yield a blocky distribution 2


0
0 30 60 More
 can obscure important patterns of
Temperature

variation.
General Guidelines

 Number of Data Points Number of Classes


under 50 5- 7
50 – 100 6 - 10
100 – 250 7 - 12
over 250 10 - 20
 Class widths can typically be reduced as the
number of observations increases
 Distributions with numerous observations are
more likely to be smooth and have gaps filled
since data are plentiful
 Horizontal vs. vertical bars
Measures of Shape: Skewness

Skewed Skewed
to Left Symmetric to Right

FREQUENCY
FREQUENCY
FREQUENCY
Summary
 Basics of descriptive statistics
 Tables and graphs
 Inferential statistics
 Textbook Reading
 Chapter 1 (page 1-28)
 Chapter 8 (page 229-243)

 Motion Charts (link)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy