Statistics Introduction
Statistics Introduction
Statistics Introduction
Statistics Introduction
Statistics is used in all kinds of science and business applications.
Statistics gives us more accurate knowledge which helps us make better decisions.
Statistics can focus on making predictions about what will happen in the future. It can also focus on
explaining how different things are connected.
1. Gathering data
3. Making conclusions
Statistics can be used to explain things in a precise way. You can use it to understand and make
conclusions about the group that you want to know more about. This group is called the population.
Gathering data about the population will give you a sample. This is a part of the whole population.
Statistical methods are then used on that sample.
The results of the statistical methods from the sample is used to make conclusions about the
population.
Note: The word 'statistic' can also refer to specific bits of knowledge; like the average value of
something.
Note: Data from a proper sample is often just as good data from the whole population, as long as it
is representative! A good sample allows you to make accurate conclusions about the whole
population.
Descriptive Statistics
Descriptive statistics is also useful for guiding further analysis, giving insight into the data, and
finding what is worth investigating more closely.
Statistical Inference
Probability theory is used to calculate the certainty that those statistics also apply to the population.
Confidence intervals are numerical ways of showing how likely it is that the true value of this statistic
is within a certain range for the population.
Hypothesis testing is a another way of checking if a statement about a population is true. More
precisely, it checks how likely it is that a hypothesis is true is based on the sample data.
Some examples of statements or questions that can be checked with hypothesis testing:
Note: Confidence intervals and hypothesis testing are closely related and describe the same things in
different ways. Both are widely used in science.
Causal Inference
Causal inference is used to investigate if something causes another thing.
Note: Good experimental design is often difficult to achieve because of ethical concerns or other
practical reasons.
Prediction
Predictions about future events are called forecasts. Not all predictions are about the future.
Some predictions can be about something else that is unknown, even if it is not in the future.
Explanation
Making conclusions about causality should be done carefully.
Population and Samples
Population: Everything in the group that we want to learn about.
Note: Every other sampling method is compared to how close it is to a random sample - the closer,
the better.
Convenience Sampling
A convenience sample is where the participants that are the easiest to reach are chosen.
Systematic Sampling
A systematic sample is where the participants are chosen by some regular system.
For example:
* Professions
Clustered Sampling
A clustered sample is where the population is split into smaller groups called 'clusters'.
Examples:
* Brands
* Nationality
Quantitative Data
Information about something that is described by numbers. With numerical data we can calculate
statistics like the average.
Examples:
* Income
* Age
Measurement Levels
Nominal Level
Categories (qualitative data) without any order.
Examples:
* Brand names
* Countries
Ordinal level
Categories that can be ordered (from low to high), but the precise "distance" between each is not
meaningful.
Examples:
* Military ranks
Interval Level
Data that can be ordered and the distance between them is objectively meaningful. But there is no
natural 0-value where the scale originates.
Examples:
* Years in a calendar
Ratio Level
Data that can be ordered and there is a consistent and meaningful distance between them. And it
also has a natural 0-value.
Examples:
* Money
* Age
Different kinds of averages, like mean, median and mode, are measures of the center.
Statistics like standard deviation, range and quartiles are measures of variation.
Frequency Tables
One typical of presenting data is with frequency tables.
A frequency table counts and orders data into a table. Typically, the data will need to be
sorted into intervals.
Visualizing Data
Different types of graphs are used for different kinds of data. For example:
* Pie charts for qualitative data
* Histograms for quantitative data
Average
The Center of the Data
The center of the data is where most of the values in the data are located.
There are different types of averages. The most commonly used are:
* Mean
* Median
* Mode
Mean
The mean is the sum of all the values in the data divided by the total number of values in
the data.
Calculating the population mean (μ) and sample mean (x) is done with this
formula:
Calculation with Programming
Median
The median is the middle value in a data set ordered from low to high. The
median is a type of average value, which describes where the center of the data
is located.
Finding the Median with Programming
Mode
The mode is the value(s) that appears most often in the data.
There are different measures of variation. The most commonly used are:
• Range
• Quartiles and Percentiles
• Interquartile Range
• Standard Deviation
Range
The range is the difference between the smallest and the largest value of the
data.
Calculating the Range with Programming
Quartiles
Quartiles are values that separate the data into four equal parts.
Percentiles
Percentiles are values that separate the data into 100 equal parts.
The 25th percentile (P25%) is the same as the first quartile (Q1).
The 50th percentile (P50%) is the same as the second quartile (Q2) and the
median.
The 75th percentile (P75%) is the same as the third quartile (Q3)
Interquartile Range
Interquartile range is the difference between the first and
third quartiles (Q1 and Q3).
Calculating the Interquartile Range with
Programming
Standard Deviation
Standard deviation (σ) measures how far a 'typical' observation is from the
average of the data (μ).
If the data is normally distributed:
Note: A normal distribution has a "bell" shape and spreads out equally on
both sides.
Calculating the Standard Deviation with
Programming
Estimation
Statistics from a sample are used to estimate population parameters.The
Hypothesis Testing
Hypothesis testing is a method to check if a claim about a population is
true. More precisely, it checks how likely it is that a hypothesis is true is
based on the sample data.
The steps of the test depends on:
Normal Distribution
The normal distribution is described by the mean (μ) and the standard
deviation (σ).
The area under the curve of the normal distribution represents probabilities
for the data.
The standard deviation describes how spread out the normal distribution is.
Probability Distributions
Probability distributions are functions that calculates the probabilities of the
outcomes of random variables.
Typical examples of random variables are coin tosses and dice rolls.
FORMULA OF Z-SCORE-
Z-Values
Z-values express how many standard deviations from the mean a value is.
Z=x−μ/σ=200−170/10=30/10=3
Using either method we can find that the probability is ≈0.9987, or 99.87%
Which means that Bob is taller than 99.87% of the people in Germany.
Finding P-value
To find the p-value above the z-value we can calculate 1 minus the
probability.
T Distribution
The t-distribution is used for estimation and hypothesis testing of a
population mean (average).
If the sample is small, the t-distribution is wider. If the sample is big, the t-
distribution is narrower.
The bigger the sample size is, the closer the t-distribution gets to the
standard normal distribution.
Estimation
Point estimates are the most likely value for a population parameter.
The point estimate for the average height of people in Denmark is 180
cm.
The average height of people in Denmark is between 170 cm and 190 cm.
Here, 170 cm is the lower bound, and 190 cm is the upper bound.
• 90% (0.90)
• 95% (0.95)
• 99% (0.99)
The higher the confidence level, the bigger the interval will be.
For example, the confidence intervals for the average height of people in
Denmark might be:
The margin of error is based on the confidence level and the data we have
from the sample.
For example, if the point estimate for the average height of people in
Denmark is 180 cm:
One condition is that the sample is randomly selected from the population.
The other conditions depends on what type of parameter you are calculate
the confidence interval for.
In our example, we randomly selected 6 people that were born in the US.
The rest were not born in the US, so there are 24 in the other category.
p^=x/n=630=0.2―=20%
Note: A 95% confidence level means that if we take 100 different samples
and make confidence intervals for each:
The true parameter will be inside the confidence interval 95 out of those 100
times.
We use the standard normal distribution to find the margin of error for the
confidence interval.
The remaining probabilities (α) are divided in two so that half is in each tail
area of the distribution.
The values on the z-value axis that separate the tails area from the middle
are called critical z-values.
The critical z-value Zα/2 is calculated from the standard normal distribution
and the confidence level.
The standard error p^(1−p^)/n is calculated from the point estimate (p^)
and sample size (n).
In our example with 6 US-born Nobel Prize winners out of a sample of 30 the
standard error is:
p^(1−p^)/n=0.2(1−0.2)30=0.2⋅0.830=0.1630=0.00533≈0.073
In our example the point estimate was 0.2 and the margin of error was
0.143, then:
p^−E=0.2−0.143=0.057―
p^+E=0.2+0.143=0.343―
[0.057,0.343] or [5.7%,34.3%]
Hypothesis Testing
A hypothesis is a claim about a population parameter.
The null hypothesis (H0) and the alternative hypothesis (H1) are the
claims.
The two claims needs to be mutually exclusive, meaning only one of them
can be true.
In this case, the parameter is the average height of people in Denmark (μ).
H0: μ=170cm
H1: μ>170cm
If the data supports the alternative hypothesis, we reject the null hypothesis
and accept the alternative hypothesis.
If the data does not support the alternative hypothesis, we keep the null
hypothesis.
• α=0.1 (10%)
• α=0.05 (5%)
• α=0.01 (1%)
A lower significance level means that the evidence in the data needs to be
stronger to reject the null hypothesis.
The size of the rejection region is decided by the significance level (α).
The value that separates the rejection region from the rest is called
the critical value.
If the test statistic is inside this rejection region, the null hypothesis
is rejected.
The p-value of the test statistic is the area of probability in the tails of the
distribution from the value of the test statistic.
If the p-value is smaller than the significance level, the null hypothesis
is rejected.
The p-value directly tells us the lowest significance level where we can
reject the null hypothesis.
Chi Square
A chi-squared test (symbolically represented as χ2) is basically a data analysis on the basis
of observations of a random set of variables. Usually, it is a comparison of two statistical
data sets. A chi-square test is a statistical test used to compare observed results
with expected results. The purpose of this test is to determine if a difference
between observed data and expected data is due to chance, or if it is due to a
relationship between the variables you are studying. It gives the probability of
independent variables.
Formula
Note: Chi-squared test is applicable only for categorical data, such as men and women
falling under the categories of Gender, Age, Height, etc.
Anova Test
• Analysis of variance, or ANOVA, is a statistical method that
separates observed variance data into different components to use
for additional tests.
• A one-way ANOVA is used for three or more groups of data, to gain
information about the relationship between the dependent and
independent variables.
• If no true variance exists between the groups, the ANOVA's F-ratio
should equal close to 1.
F=ANOVA coefficient
Type of Anova
There are two main types of ANOVA:
• Two-way.
ANOVA) differs from ANOVA as the former tests for multiple dependent
variable. It determines whether all the samples are the same. The one-way
groups.
With a two-way ANOVA, there are two independents. For example, a two-
observe the interaction between the two factors and tests the effect of two
advertisement at 10 different stores for one month and measure total sales