0% found this document useful (0 votes)
23 views22 pages

IE101 Reviewer

The document provides an overview of basic concepts in statistics and probability, including descriptive statistics, types of data, measures of central tendency and dispersion, and various probability distributions. It also covers sampling methods, confidence intervals, and hypothesis testing. Key statistical tools and metrics such as mean, variance, standard deviation, and graphical representations like box plots and scatter plots are discussed to facilitate data analysis.

Uploaded by

rain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views22 pages

IE101 Reviewer

The document provides an overview of basic concepts in statistics and probability, including descriptive statistics, types of data, measures of central tendency and dispersion, and various probability distributions. It also covers sampling methods, confidence intervals, and hypothesis testing. Key statistical tools and metrics such as mean, variance, standard deviation, and graphical representations like box plots and scatter plots are discussed to facilitate data analysis.

Uploaded by

rain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Basic Concepts of Statistics & Probability

Descriptive branch of statistics that focuses on describing,


statistics summarizing, and organizing information.
term used to describe information that derives from
Data
some form of measurement
term used to refer to interpretations or knowledge
Information based on data analysis. When data is made
meaningful, we say we have information
Quantitative counts or measurements for which representation
data on a numerical scale is naturally meaningful.
consist of labels, category names, ratings, rankings,
Qualitative
and such for which representation on a numerical
data
scale is not naturally meaningful.
Discrete quantitative data that are countable using a finite
data count, such as 0, 1, 2, and so on.
quantitative data that can take on any value within a
Continuous
range of values on a numerical scale in such a way
data
that there are no gaps, jumps, or other interruptions.
Nominal
qualitative data that consist of categories
Level
qualitative data that consist of categories that can be
Ordinal arranged in some meaningful order according to
Level their relative size or quality, but differences between
data values are meaningless
quantitative data where the zero point does not
Interval
correspond to the absence of the quantity being
Level
measured
Ratio Level quantitative data where scale has a meaningful zero
a graphical tool used to see if a relationship appears
Scatter Plot
between two variables

Measure of Central Tendency


Mean ∑𝑥
(average) 𝑥̅ = 𝑛
Mode 𝑥̂ = most popular number
Median 𝑥𝑚 = middle data when arranged in ascending order
Measure of Dispersion
2
Variance 𝑠 2 = 𝜎 2 = ∑(𝑥𝑖−𝑥̅ )
𝑛−1
Standard 2 ∑(𝑥𝑖 −𝑥̅ )2
Deviation 𝜎 = √𝜎 = √ 𝑛−1
Low standard deviation means data are clustered around the mean,
and high standard deviation indicates data are more spread out
Range R = Maximum - Minimum
Skewness - statistical metric used to measure the asymmetry of a
probability distribution of random variables about its mean
(𝑋𝑖 −𝑋̅)3
∑𝑁
𝑖 less than -1 → highly skewed
(𝑁−1)×𝜎3
-1 to -0.5 → moderately skewed
𝑋𝑖 = ith random variable -0.5 to 0.5 → fairly symmetrical
𝑋̅ = Mean of the distribution 0.5 to 1 → moderately skewed
N = number of variables greater than 1 → highly skewed
σ = standard deviation

Kurtosis - describe the degree of flatness or peakedness of a sample


distribution.
(>3) → leptokurtic
(=3) → mesokurtic
(<3) → platykurtic
Frequency Distribution
1. Calculate the Range of the sample data.
2. Determine the number of classes (k).
→ 2k > n rule or log(n)/log(2).
3. Compute the class interval (width or size of each class)
4. Determine the class boundaries. (round up)
5. Construct a tally sheet or frequency table.
Box Plot - type of chart often used in explanatory data analysis to
visually show the distribution of numerical data and skewness through
displaying the data quartiles (or percentiles) and averages.

∘ Minimum Score - the lowest score, excluding outliers (shown


at the end of the left whisker).
∘ Lower Quartile - twenty-five percent of scores fall below the
lower quartile value (also known as the first quartile).
∘ Median - the median marks the mid-point of the data and is
shown by the line that divides the box into two parts
(sometimes known as the second quartile). Half the scores are
greater than or equal to this value and half are less.
∘ Upper Quartile - seventy-five percent of the scores fall below
the upper quartile value (also known as the third quartile).
Thus, 25% of data are above this value.
∘ Maximum Score - the highest score, excluding outliers
(shown at the end of the right whisker).
∘ Whiskers - the upper and lower whiskers represent scores
outside the middle 50% (the lower 25% of scores and the
upper 25% of scores).
∘ Interquartile Range (or IQR) - this is the box plot showing
the middle 50% of scores (i.e., the range between the 25th and
75th percentile).
∘ Comparison of location: Compare the respective medians, to
compare location.
∘ Comparison of Dispersion: Compare the interquartile ranges
(that is, the box lengths), to compare dispersion. Look at the
overall spread as shown by the adjacent values.
∘ Comparison of Skewness: Look for signs of skewness. If the
data do not appear to be symmetric, does each batch show the
same kind of asymmetry?
∘ Comparison of Outliers: Look for potential outliers.
Sample Space, Events, and Probability
an experiment that can result in different
Random
outcomes, even thought it is repeated in the
Experiment
same manner every time
set of all possible outcomes of a random
Sample Space
experiement
a subset of the sample spcae of a random
Event
experiment

union (∪)
intersection (∩)
complement (‘)

used to quantify the likelihood, or chance, that


Probability
an outcome of a random experiment will occur
Conditional P(A|B) = P(A∩B)/P(B)
Probability P(B|A) = P(A∩B)/P(A)
Random Variables
Random variable – a function that assigns a real number to each
outcome in the sample space of a random experiment.
Discrete Probability Distribution
Expected Value
or Mean

Variance

Standard 2
Deviation 𝜎 = √𝜎

Continuous Probability Distribution


Expected Value
or Mean

Variance

Standard 2
Deviation 𝜎 = √𝜎
Discrete Probability Distribution
When to use a Distribution
Discrete Uniform → each outcomes have equal probability of
Distribution occurrence (equally likely outcomes)
→ n repeated trials
→ each trial has two possible outcomes
Binomial (success/failure)
Distribution → probability of success is same on all
trials
→ trials are independent
Negative → same w/ BD but experiment continues
Binomial (trials are performed) until a total of k
Distribution successes
Geometric → same w/ NBD but experiment stops
Distribution after the 1st success (k=1)
Multinomial → same with binomial but can have
Distribution multiple outcomes instead of two only
→ consists of finite population N
→ Each individual can be characterized as a
success/failure, and there are k
Hypergeometric successes in the population.
Distribution → sample of n individuals is selected from
N without replacement in such a way
that each subset of size n is equally likely
to be chosen
Multivariate → N items can be partitioned into cells
Hypergeometric → random variable has probability of being
Distribution selected from subtypes

Poisson → Describes the number of events


Distribution occurring in a fixed time interval or
region of opportunity
Discrete Uniform Distribution

a = lower limit
Parameters
b = upper limit
1
Probability Distribution Function 𝑃(𝑥) =
𝑏−𝑎+1
𝑥−𝑎+1
Cumulative Distribution Function 𝐹(𝑥) =
𝑏−𝑎+1
𝑎+𝑏
Mean E(X)
2
(𝑏−𝑎+1)2 −1
Variance
12

Binomial Distribution

n = number of attempts
p = probability of success per attempt
Parameters
q = (1-p) = probability of failure per attempt
x = number of successful attempts among n
Probability
Distribution 𝑃(𝑥) = 𝑛∁𝑥 × 𝑝 𝑥 × 𝑞 𝑛−𝑥
Function
Cumulative
Distribution 𝐹(𝑥) = ∑𝑥0 𝑛∁𝑥 × 𝑝 𝑥 × 𝑞 𝑛−𝑥
Function
Mean E(X) 𝑛𝑝

Variance 𝑛𝑝𝑞
Negative Binomial Distribution
k = number of successful attempts needed
p = probability of success per attempt
Parameters q = (1-p) = probability of failure per attempt
x = number of attempts, kth success should occur
here
Probability
Distribution 𝑃(𝑥) = (𝑥−1)∁(𝑘−1) × 𝑝 𝑘 × 𝑞 𝑥−𝑘
Function
𝑘
Mean E(X) 𝑝
𝑘(1−𝑝)
Variance
𝑝2

Geometric Distributions
p = probability of success per attempt
q = (1-p) = probability of failure per attempt
Parameters
x = number of attempts, 1st success should occur
here
Probability
Distribution 𝑃(𝑥) = 𝑝1 × 𝑞 𝑥−1
Function
1
Mean E(X) 𝑝
(1−𝑝)
Variance
𝑝2

Multinomial Distribution
Probability
Distribution
Function

Parameters
Hypergeometric Distribution
k = number of successes in the population
n = number of sample objects from the population
Parameters N = size of the population
x = number of K typed obj among the sampled n
obj
Probability 𝑁−
(𝑘 )( −𝑘 )
Distribution 𝑃(𝑥) = 𝑥 𝑛 𝑥
𝑁
( )
Function 𝑛
𝐾
Mean E(X) 𝑛
𝑁
𝐾 (𝑁−𝐾) 𝑁−𝑛
Variance 𝑛
𝑁 𝑁 𝑁−1

Multivariate Hypergeometric Distribution

Poisson Distribution
Continuous Probability Distribution
Continuous Uniform Distribution
a = lower limit
Parameters
b = upper limit
1
Probability Distribution Function 𝑃(𝑥) =
𝑏−𝑎
𝑥−𝑎
Cumulative Distribution Function 𝐹(𝑥) =
𝑏−𝑎
𝑎+𝑏
Mean E(X)
2
(𝑏−𝑎)2
Variance
12

Standard Normal Distribution - distribution that occurs when a


normal random variable has a mean of zero and a standard deviation of
one.
𝑥−𝜇
Sample z-score 𝑧 = 𝜎
𝑥−𝜇
Multiple sample z-score 𝑧 = 𝜎
√𝑛

Probability P(z) = STAT>DISTR>P()

Exponential Distribution - used to model the time elapsed between


events
Parameter β = expected value of X
𝑥
Probability Function 𝑓(𝑥) = 1 𝑒 −(𝛽)
𝛽
Cumulative Distribution 𝑥
−𝛽
Function 𝑃(𝑋 ≤ 𝑥) = 𝐹(𝑥) = 1 − 𝑒

If ≤ or < is reversed to ≥ or >, subtract the answer to 1


Joint Probability Distribution

𝑨 𝑩 𝑪
( )( )( )
𝑷(𝑿, 𝒀) = 𝑿 𝒀 𝟐 − 𝑿 −𝒀
𝑨+𝑩+𝑪
( )
𝟐
Sampling Distribution
group that includes all the cases (individuals, objects,
Population or groups) in which the researcher is interested.
Size=N.
Sample A relatively small subset from a population. Size=n.
Each individual is chosen entirely by chance and
Random each member of the population has a known, but
sampling possibly non-equal, chance of being included in the
sample.
process of selecting a sample that allows individual
Simple in the defined population to have an equal and
Random independent chance of being selected for the
Sampling sample. It can be done using a computer, calculator,
or a table of random numbers
every Kth member (K is a ration obtained by
dividing the population size by the desired sample
Systematic
size) in the total population is chosen for inclusion
Random
in the sample after the first member of the sample is
sampling
selected at random from among the first K
members of the population.
There may often be factors which divide up the
Stratified
population into subpopulations (groups / strata) and
Random
we may expect the measurement of interest to vary
Sampling
among the different sub-populations.
entire population is divided into groups, or clusters,
Cluster and a random sample of these clusters are selected.
Sampling All observations in the selected clusters are included
in the sample.
characteristics of a population or, more specifically,
a
Parameters
target population. Parameters may also be termed
population values.
Central
Limit
Theorem
point
specific numerical value estimate of a parameter
estimate
difference between the expected value of the
bias estimator and the value of the parameter being
estimated.
standard error of an estimator is its approximated
standard deviation
standard
error
Sampling Distribution
Confidence Level - probability that the value of a parameter falls
within a specified range of values. It is percentage equivalent to the
decimal value of 1–α.
Confidence Intervals for the Mean (σ Known or n≥30)
Confidence Level
90% = 1.645
95% = 1.96
99% = 2.58
Maximum difference
between the point estimate
Maximum error of
of a parameter and the
estimate
actual value of the
parameter
Minimum size
needed for an
Interval Estimate

Confidence Intervals for the Mean (σ Unknown or n<30)

Confidence Intervals for Proportions

Minimum size
needed for an
Interval Estimate
Confidence Intervals for Variances and Standard Deviations

Variance

Standard Deviation

Prediction Intervals

σ2 Known

σ2 Unknown

Tolerance Intervals
𝑥̅ ± 𝑘𝑠
Hypothesis Testing
process that uses sample statistics to test a claim
hypothesis test
about the value of a population parameter
statistical verbal statement, or claim, about a population
hypothesis parameter
states that there is no difference between groups
null hypothesis
or no relationship between variables.
alternative states what you expect the data to show, based
hypothesis on your research on the topic.
type I error Rejection of the null hypothesis when it is true
Non-rejection of the null hypothesis when it is
type II error
false

Steps in Hypothesis Testing


1. Specify the
Hypotheses
level of significance is
the maximum
probability of
committing a type I
error.
critical value(s) – separates the critical region
2. Find the from the noncritical region
critical critical or rejection region – the range of
value/s values of the test value that indicates that there is
a significant difference and that the null
hypothesis should be rejected
noncritical or non-rejection region – the range
of values of the test value that indicates that the
difference was probably due to chance and that
the null hypothesis should not be rejected
test statistic – statistic that is compared with the
parameter in the null hypothesis
3. Calculate
the Test
Statistic
(test value)
P-value – probability of obtaining a sample
4. Find the p-
statistic with a value as extreme or more extreme
value
than the one determined from the sample data
Decision rule based on P-value
5. Make If P ≤ α, then reject HO
decision to If P > α, then fail to reject HO
reject or
not the null Decision rule based on test value
hypothesis If test value ≤ critical value, then reject HO
If test value > critical value, then fail to reject HO

Statistics Tests
t-test for sample mean
(σ Unknown or n<30)

z-test for sample mean


(σ Known or n≥30)

z-test for proportions


(np and nq ≥ 5)

χ2 test for variance &


standard deviation
Two Sample Hypothesis Testing

Two sample z-test


- randomly selected
- independent
- at least 30 or
known standard
deviation
Two sample t-test
- less than 30
- randomly selected
- independent
- normal distribution

Two sample t-test


(population variances are
not equal)

Two sample t-test


(population variances are
equal)

Two sample z-test for


proportions
(np and nq ≥ 5)
Correlation and Regression
Scatter plot – a graph of the ordered pairs (x,y) of numbers consisting
of the independent variable, x, and the dependent variable, y.
Regression
One-way Analysis of Variance (ANOVA)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy