FIN10002 - Notes Master

Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

FUNCTIONS:

Excel: Python:
- FREQUENCY() - pandas.value_counts()
- COUNTIF() - matplotlib.pyplot.hist()
- VSTACK() - matplotlib.pyplot.show()
- AVERAGE() - matplotlib.pyplot.savefig()
- MEDIAN() - pandas/numpy.mean()
- MODE() - pandas/numpy.median()
- PERCENTILE() - pandas/numpy.mode()
- QUARTILE() - pandas.quantile()
- MAX() - numpy.quantile()
- MIN() - numpy.percentile()
- STDEV() - pandas.std()
- VAR() - pandas.var()
- SKEW() - pandas.describe()
- STANDARDIZE() - numpy.ptp()
- NORM.S.DIST() - scipy.stats.rv_discrete.mean()
- NORM.S.INV() - scipy.stats.rv_discrete.var()
- NORM.DIST() - scipy.stats.rv_discrete.std()
- NORM.INV() - scipy.stats.norm.cdf()
- CONFIDENCE.NORM() - scipy.stats.norm.sf()
- CONFIDENCE.T() - scipy.stats.norm.ppf()
- - scipy.stats.norm.isf()
- - scipy.stats.norm.interval()
- - scipy.stats.t.interval()
- - math.sqrt()
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
Equations:
LECTURE NOTES WEEK 1
Introduction To Statistics
The success or failure of a business hinges largely on the decisions made by the people within
it. Information is integral to decision making, and information is often the outcome of data
collection, analysis, interpretation and reporting.

Informing Business Strategy: Data VS Information

Data is part of the equation of information, but it is not information in it of itself. TO BE USEFUL,
data must be gathered, processed, stored, manipulated, and tested, using various statistical
methods.

1.1 What is Statistics?


What is Financial Statistics?

Financial Statistics – a collection of procedures and techniques used to convert data into
meaningful information in a financial (business) environment.
Statistics – is a mathematical science concerned with data collection, analysis,
interpretation, explanation and presentation.

Statistical Procedures

Can be split into two categories.

Descriptive – procedures and techniques designed to


describe data.
▪ Charts (graphs) & Tables
▪ Numerical measures (e.g. averages (mean),
mode, median etc.)

Inferential – tools and techniques that help decision makes to draw inferences from a set of
data.
▪ Estimation (e.g. estimating the population mean
weight using the sample mean weight)
▪ Hypothesis testing (e.g. using sample evidence
to test the claim that the population mean weight
is 75kg)
Procedures for Collecting Data

There are numerous ways data can be collected. Some include:


- experiments,
- telephone surveys,
- written questionnaires,
- surveys,
- direct observation,
- personal interviews.

Population VS Sample

Population – the set of objects of interest (e.g. ‘all bunnings


employees’, ‘all business students’, ‘all cars in carparks’).
Census – the process of making measurements on the
whole population for variables of interest.

Sample – a subset of the population.


Sampling – the process of choosing a sample according
to valid statistical principles.

Parameters & Statistics

Parameters – descriptive numerical measures,


such as an average or a proportion, that are
computed from an entire population (generally
represented by Greek letters).

Statistics – corresponding measures computed for


a sample.

The Process of Inferential Statistics


Data Collection

Experiment – a process that produces a single outcome whose result cannot be predicted with
certainty.
Experimental Design – a plan for performing an experiment in which the variable of
interest is defined.

Written Questionnaire/Survey – formed by a combination of close-end questions and open-


end questions.
Closed-end questions – questions which require the respondent to select from a list of
defined choices.
Open-end questions – questions that allow respondents the freedom to respond with
any value, words, or statements of their own choosing.

Direct Observations – data being collected is physically observed and recorded based on what
takes place in the process.
Subjective and time consuming.

Personal Interviews – can be structured (questions are scripted) or unstructured (begin with
one or more broadly stated questions, with further questions being based on responses).

Types of Data & Measurement Levels

Data Timing
Cross-sectional Data – data that is
collected at a fixed point in time.
E.g.: Businesses often conduct surveys
to gauge consumer sentiment about a
new product. This captures data at a
single point in time.

Time-series Data – data which is collected over time.


E.g.: Businesses collect and track their sales data on a daily, weekly, monthly, quarterly
or yearly basis, allowing them to see patterns over time.
Data Types

Quantitative – measurements whose values are inherently numerical.


Discrete – whole numbers (e.g. number of children).
Continuous – fractional/decimal numbers (e.g. weight, volume).

Qualitative – data whose measurement scale is inherently categorical (e.g. martial status,
political affiliation, eye colour).

Data Measurement Levels

Nominal Data – is the lowest form of data, data which can be


labelled/classified into mutually exclusive categories.
Examples: student ID number, favourite colour, gender,
marital status, category codes, category names.

Ordinal Data – where data elements can be ordered on the basis


of some relationship among them.
Examples: rankings (e.g. class standing (undergrad, grad,
etc.)), ordered categories (e.g. age groups), satisfaction level.

Interval Data – if the distance between two data items can be


measured on some scale, and the data has ordinal properties, the
data is said to be interval data. When there is a zero value, it does
not imply there is nothing.
Example: TEMPERATURE. Temperature is interval scale
because there is no true zero value.

Ratio Data – data that has all the characteristics of interval data
but has a true zero value/point.
Examples: weight, time, pay rate per hour, interest rates.
Week 2
Frequency Distributions & Histograms

Frequency Distribution – a summary of data presented in the form of class intervals and their
corresponding frequency.
A set of data that displays the number of observations in each of the distribution’s
distinct categories/classes. Is a list or table which contains the values of a variable (or
set of ranges within which the data falls). Contains the frequencies of with which each
value occurs.
Un-grouped Data – (raw data), is data which has not been summarised in any
way.
Grouped Data – is data which has been organised into a frequency distribution.

Discrete Data – data that can take on a countable number of possible values (e.g. student ages,
amazon product categories, etc.).

Continuous Data – data whose possible values are uncountable and that may assume any
value in an interval (weight, length, time, etc.).

Relative Frequency

Relative Frequency – the proportion of total observations that are in a given category.
𝑓𝑖
𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 = In simple terms:
𝑛
Frequency of values
𝑓𝑖 = Frequency of the ith value of the discrete variable.
𝑘 Total number of observations

𝑛 = ∑ 𝑓𝑖 | (𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠)


𝑖=1
𝑘 = the number of different values for the discrete variable

How to Do it: Excel

Frequency Function
Syntax:
=FREQUENCY([range/array of data],[bins/class array])

Data Analysis Tool


Does everything for you. Select data analysis, select ‘histogram’, input the values, select an
output destination, tweak as you desire, then press ok.

CountIf Function
Syntax:
=COUNTIF([range/array of data],[criteria])

Pivot Table Approach


First make it so all data is in a single column (use vertical stack function [=VSTACK()]). Select a
cell in the data column, then insert pivot table. Data is automatically highlighted. Click ok, then
select the data to be the rows, and then value by count.
Frequency Tables (Grouped Data)
Criteria for Building Classes

There are certain criteria which must be followed when building classes/bins:
- Classes must be mutually exclusive (not overlapping).
- Classes must be all-inclusive (a set of classes should include all possible data values).
- Classes should be of equal width, if possible (the distance between the lowest and
highest values of each class should be equal).
- Empty classes should be avoided.

How many Classes?

Many (narrow class intervals):


- May yield very varying distributions and have empty classes.
- Can give poor indication of how frequency varies across class.

Few (wide class intervals):


- May compress variation too much.
- Can obscure variation.

How to Calculate the Number of Classes

Rule of thumb: between 5-20 classes.

𝟐𝒌 ≥ 𝒏 RULE
Where k is the number of classes and is defined as the smallest integer so that 2𝑘 ≥ 𝑛, where n
is the number of data values.

To calculate the width of a class:


Maximum Data Value − Minimum Data Value
Class Width =
Number of Classes
Cumulative Frequency

Cumulative Frequency Distribution – a summary of a set of data that displays the number of
observations with values less than or equal to the upper limit of each class.

Cumulative Relative Frequency Distribution – a summary of a set of data that displays the
proportion of observations with values less than or equal to the upper limit of each class.

The frequency function, pivot table and data analysis both worth with cumulative as well.
Python Frequency Analysis

Need to import pandas


library to read document.

To count the number of


appearances of a certain
value, use the function:

[Name].value_counts()

Grouping data is very simple in python. You can just specify how many bins you desire. It is
important though, that you name the column which will have bins in it (in this example:
data[‘Data’].value_counts)

Histograms
Histogram – a graph of frequency distribution with the horizontal axis showing the classes and
the vertical axis showing the frequency counts (a visual representation of a frequency
distribution). Important: no gaps between the bar in the graph.

Easiest way to do it in Excel:


Use the data analysis tool. You can create a histogram when creating the frequency distribution.
Must adjust the gap width to 0.
Python Histograms

To chart histograms, one must import the library/module: matplotlib.pyplot (as plt)

Important functions in the matplotlib library:


- plt.hist(data, bins = XX, color = ‘XXX’) – to create a histogram (data is the file you are
reading further back in the code and have defined as whatever variable).
- plt.title(“name”, fontsize = XX, color = ‘XXX’) – names the graph
- plt.xlabel(‘name’, fontsize = XX, color = ‘XXX’) – labels the x-axis
- plt.ylabel(‘frequency’, fontsize = XX, color = ‘XXX’) – labels the y-axis
- plt.show() – shows the histogram
- plt.savefig(‘name.png’) – saves plotted data as a .jpg

Example:

Summary
Week 3
Central Tendency

Measures of Central
Tendency – (specifies where
the data is centred), is a
summary measure that
attempts to describe a whole
set of data with a single value
that represents the
middle/centre of its
distribution.

Measures of Location – includes the measures of central tendency, but also other measures
that illustrate the location or distribution of the data.

Mean

Mean – also known as ‘average’. There are many different types of means:
❖ population,
❖ sample,
❖ weighted,
❖ geometric,
❖ quadratic,
❖ harmonic.
❖ The most common: ARITHMETIC mean - the sum of the data points, divided by the
number of the data points.

One can easily perform an average calculation with the function AVERAGE() in Excel.

One can graphically represent a mean as a fulcrum:


Characteristics of a mean:
- Uses all the data
- Can be skewed by outliers (large data values)
- Is the ‘balance point’

Median

Median – is a centre value that divides a data array into two halves. In an ordered array, the
median is the “middle” number (50% of the data is above the median, 50% is below).
Data array – data that has been arranged in numerical order.
The median is not affected by outlier values.

One can find the median with the MEDIAN() function in Excel.

Skewed and Symmetric Distributions

Symmetric Data – data sets whose values are evenly spread around the centre.

Skewed Data – data sets which are not symmetric.

Mode

Mode – the value with the highest frequency.


▪ Is not affected by outliers.
▪ Can be used for quantitative or qualitative data.
▪ Can have more than one, or no, mode.

Mode can be found with the MODE() function in Excel. There is MODE.MULT() which will return
all the modes, or MODE.SNGL() which will only return one mode.
Weighted Mean

Weighted Mean – the mean value of data values that have been weighted according to their
relative importance.

Example:

Other Means

Geometric Mean – most frequently used to average rates of change over time or to compute the
growth rate of a variable.

Harmonic Mean – weighted mean in which observation’s weight is inversely proportional to its
magnitude.

Python Using Pandas Library

There are the mean(), median() and mode() functions in Python.

To apply it to a document, one can use Pandas to read the document, then one must specify
which column/dataset one is calculating from, then one can PRINT, the calculated value.

Example:
Advantages and Disadvantages of Each Measure

Measures of Location: Percentiles & Quartiles

Calculating Percentiles

Steps:
1. Sort the data from lowest to highest.
𝑝
2. Determine the percentile location index: 𝑖 = (𝑛)
100
3. If 𝑖 is not an integer, then round to the next highest integer. The pth percentile is located at
the rounded index position. If 𝑖 is an integer, the pth percentile is the average of the
values at location index positions 𝑖 and 𝑖 + 1.
Example:
Percentiles in Excel

Excel has two functions for any percentile:


- PERCENTILE.EXC([data_array],[percentage]) → Closer to the manual way of
interpolating middle number.
- PERCENTILE.INC([data_array],[percentage]) → Less close to the manual way of
interpolation.

Excel has quartile functions as well (both EXC & INC):


- QUARTILE.EXC([data_array],[percentage])

Python (Pandas) .QUANTILE

In Python’s Pandas library, the function .quantile(q=XX) is used to calculate the percentile. There
is also many ways you can specify the type of interpolation you desire.

Python (Numpy) .QUANTILE and .PERCENTILE

The Python library Numpy also has percentile functions. They are
numpy.quantile([data],[percent]) and numpy.percentile([data,[percent]). The .quantile function
is the QUARTILE function.

To utilise Numpy to analyse a dataset, one will still need to import the Pandas library to read the
file.
Measures of Variance/Dispersion

Variation
Variation – a set of data exhibits variation if all/any of the data is not the same value.

Measures of Variation – give information about the spread/variability/dispersion of the data. A


smaller value means less variation, a larger means more variation.

Range

Range – a measure of variation that is computed by finding the difference between the
maximum and minimum values in a data set.
- The simplest measure of variation.
- Is very sensitive to extreme values/outliers.
- Ignores data distribution.

𝑅 = [𝑀𝑎𝑥 𝑉𝑎𝑙𝑢𝑒] − [𝑀𝑖𝑛 𝑉𝑎𝑙𝑢𝑒]

Interquartile Range – the range between the 1st and 3rd quartiles of data (the inner 50%). Used
when data is very spread, but most values are in the interquartile.
▪ Attempts to counteract some imbalance caused by extreme outlier values.
▪ Eliminates high- & low- valued observations.
𝐼𝑛𝑡𝑒𝑟𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑅𝑎𝑛𝑔𝑒 = 𝑄3 − 𝑄1
Population Variance

Population Variance – the average of the squared distances of the data values from the mean.
Shortcut formula:

X is each value in the dataset. So for the shortcut formula, it is sum of values2.

Population Standard Deviation

Population Standard Deviation – the positive square root of the variance.


The most used measure of variation. Has the same units as the original data.
Same as the
population
variance
formula/shortcut
formula.

Sample Variance/Sample Standard Deviation

Sample Variance – variance but for the sample not the population.

Sample Standard Deviation – standard deviation but for the sample not the population.

Example:
Standardised Data Values

Standardised Data Values – the number of standard deviations a value is from the mean. Can
sometimes be referred to as z-scores.

Standardised Population Data:


𝑥−𝜇
ɀ= where:
𝜎

Standardised Sample Data:


𝑥−𝑥
ɀ= where:
𝑠

Standardised Data Values in Excel

One can use the STANDARDIZE() function to get a z-score calculated. You must provide the
mean and the standard deviation, as well as the data-point you wish to calculate with.
Excel Measures of Variance

Range: is calculated with the MAX([data]) minus the MIN([data]).

Standard Deviation: is calculated with the STDEV() function. The Population Standard
Deviation is calculated with the function STDEV.P() and the Sample Standard Deviation is
calculated with the STDEV.S() function.

Variance: can be calculated with the VAR() function. The Population Variance is calculated with
the function VAR.P() and the Sample Variation is calculated with the VAR.S() function.

Comparing Data with the Standard Deviation

When the mean has been calculated, alone, it


could point towards data sets being similar.
Only with more information can one make a
proper judgement.

The standard deviation allows one to truly see


the difference between how spread out each
data point/observation is.

The Mean and Standard Deviation Together

Coefficient of Variation (CV) – the ratio of the standard deviation to the mean, expressed as a
percentage. The coefficient of variation is used to measure variation relative to the mean.
▪ Measures relative variation.
▪ Always expressed as a percentage.
▪ Shows variation relative to the mean.
▪ Is used to compare two or more sets of data.

The Empirical Rule


If the data distribution is bell shaped (a bell curve), then the interval…
- 𝑚𝑒𝑎𝑛 ± 1𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 contains approximately 68% of the data values.
- 𝑚𝑒𝑎𝑛 ± 2𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 contains approximately 95% of the data values.
- 𝑚𝑒𝑎𝑛 ± 3𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 contains approximately all of the data values.
Skewness

Skewness – a measure of how much data has strayed from a perfect bell curve distribution.
𝑛 𝑥𝑖 −𝑥 3
One can calculate skewness with the formula (𝑛−1)(𝑛−2) ∑( 𝑠
) or with the Excel function
SKEW().

If a skewness value is positive: the distribution is positively skewed (skewed to the right).
If a skewness value is negative: the distribution is negatively skewed (skewed to the left).

Excel Data Analysis: Descriptive Statistics

In the data analysis tool, you can select ‘descriptive statistics’. You must input your data, and
then at least select ‘summary statistics’.

Almost all descriptive statistics will be outputted quickly and neatly.

Make sure to double check that the “count” figure is the same figure as the number of data
points you want to analyse. If it is wrong, then all the outputted figures will be wrong also.

Python Pandas: Descriptive Statistics

Range: .ptp([data_array])

Standard Deviation: Population: .std(ddof=0)  setting the degree of freedom to 0.


Sample: .std()

Variance: Population: .var(ddof=0)  setting the degree of freedom to 0.


Sample: .var()

Summary like data analysis: [column_heading].describe()


Week 4
The Basics of Probability

Probability – the chance that a particular event will occur. A probability value will be in a range
from 0-1 (0%-100%). If there is a probability of 1 (100%), then there is a certainty that an event
will occur. A probability of 0 (0%) represents there being no chance.

Experiment – a process that produces a single outcome whose result cannot be predicted with
certainty.

Event – a collection of experimental outcomes.

Sample Space – the collection of all outcomes that can result from a selection, decision, or
experiment.
Examples:
- All 6 faces of a die,
- All 52 cards of a deck of cards.

Defining the Sample Space Application

For two sales:

One can also restrict the outcomes, for example: no double ups, in which case the outcomes
will differ slightly.
Using a Tree Diagram to Define the Sample Space

Step 1: Define the experiment.

Step 2: Define the outcomes for a single trial of the experiment: e.g. Yes, No

Step 3: Define the sample space for the number of trials using a tree diagram:

Venn Diagrams

Venn Euler diagrams can be used to express in graphical form the sample space and the event.

Sample space: the universal set.


Event/s: enclosed by a circle within the box.

Defining an Event of Interest

Step 1: Define the experiment.

Step 2: List the outcomes associated with one trial of the experiment.

Step 3: Define the sample space.

Step 4: Define the event of interest: a collection of the outcomes possible in the sample
space, often defined by a certain condition.
Types of Events

Mutually Exclusive Events – two events are mutually exclusive if the occurrence of one event
precludes the occurrence of the other event. If events have no outcomes in common, they are
said to be mutually exclusive.

Independent Events – two events are independent if the occurrence of one event in no way
influences the probability of the occurrence of the other event.

Dependent Events – two events are dependent if the occurrence of one event impacts the
probability of the other event occurring.

Assigning Probability

Classical Probability Assessment

Classical Probability Assessment – the method of determining probability based on the ratio
of the number of ways an outcome or event of interest can occur to the number of ways any
outcome or event can occur when the individual outcomes are equally likely.

Think of a fair coin. There are two possible outcomes. The probability of landing heads is ½.
Relative Frequency Probability Assessment

Relative Frequency Probability Assessment – the method that defines probability as the
number of times an event occurs divided by the total number of an experiment is performed in a
large number of trials.
This uses historical data.

Subjective Probability Assessment

Subjective Probability Assessment – the method that defines the probability of an event as
reflecting a decision maker’s state of mind regarding the chances that the particular event will
occur.
The subjective probability is a measure of a personal conviction that an event will occur,
representing a person’s belief that an event will occur.

There is no equation to determine subjective probability as it is drawn from one’s own


experience and beliefs.

Probability as Odds

Probabilities are often stated as odds (a ratio, e.g. 3:2) for or against a given event occurring.

The Rules of Probability


Possible Values and Summation Rules

Probability Rule 1:
The probability of an event occurring is always between 0 and 1.

Probability Rule 2:
The sum of the probabilities of all possible outcomes is 1.

Probability Rule 3: Additional Rule for Individual Outcomes


The probability of an event is equal to the sum of the probabilities of individual outcomes that
form the event (events need to be mutually exclusive for this rule).

Complement Rule
The complement of event E is the collection of all possible outcomes not contained in event E.
Complement – the probability of an event not occurring.

Probability Rule 4: Addition Rule for Any Two Events


The probability of one event or the other, is the probability of one event plus the probability of
the other, minus the intersection of both.
Probability Rule 5: Addition Rule for Mutually Exclusive Events
As both events cannot occur at the same time, the probability of either occur is the sum of each
individual probability of each event.

Probability Rule 6: Conditional Probability for Any Two Events


Conditional Probability - the probability that an event will occur given that some other event
has already happened.
The | represents ‘given that’. In full, indicating ‘probability of A, given that B has already
occurred’.

Probability Rule 7: Conditional Probability for Individual Events


For independent events, the probability of one event occurring given a second independent
event has already occurred, is simply the probability of the first event occurring.

Probability Rule 8: Multiplication Rule for any Two Dependent Events


When one is unable to find the probability of two events together traditionally.

Probability Rule 9: Multiplication Rule for Independent Events


The joint probability of two independent events is simply the product of the probabilities of the
two events.
Week 5
Introduction to Probability Distributions

Random Variable – a variable that is subject to randomness, meaning it can take on different
variables. It takes on different numerical values based on the chance of some event (e.g. events
from a random experiment).

Types of Random Variables

Discrete Random Variable – can only assume a finite number of values or an infinite sequence
of values (e.g. 0, 1, 2, 3…). Whole number.
There could be many different outcomes (e.g., number of complaints per day) or only
two possible outcomes (e.g. defective item: yes or no).

Continuous Random Variable – Can assume an uncountable infinite number of values


(generally a range, can have decimals).

Expected Value (Mean)

One can calculate the mean by taking the sum of each variable multiplied by its respective
relative frequency.

Standard Deviation

One can derive the standard deviation by taking the


sum of [each variable minus the mean squared times
its relative frequency].
Python

For discrete variables, to find mean, variance and standard deviation, one can use the
rv_discrete function from the scipy.stats python library.

The scipy.stats library is quite large, so to be efficient, instead of importing the whole library,
one can just import the necessary function/s:

from scipy.stats import rv_discrete

The following functions then can be used to calculate the mean, variance and standard
deviation:
- XXXX.mean()
- XXXX.var()
- XXXX.std()

You must define the value range for “XXXX/discvar/rv_discrete/whatever you use to define your
values”.

Example (defining own variables):

Example (READING EXCEL FILE):


Week 6
Continuous Distribution

Continuous Distributions – random variables that can take values at every point over a given
interval.

The Normal Probability Distribution

Normal Probability Distribution – is a bell-shaped distribution with the following properties:


▪ Is unimodal – the normal distribution peaks
as a single value.
▪ Is symmetrical – the two areas under the
curve are identical for any two areas between
the mean and any two points of equal
distance on either side.
▪ The mean, median and mode are all equal.

A normal probability distribution is asymptotic to the x-axis, and the amount of variation in the
random sample determines the height and spread of the normal distribution.

The function for normal probability density.


Finding Normal Probabilities

When X is a continuous random variable, P(X) = 0 for any particular value of X.

The probability for a range of values between A and B is defined as the area under the curve
between the two points.

The Standard Normal Distribution

Standard Normal Distribution – a normal distribution that has a mean = 0 and a standard
deviation = 1.
The horizontal axis is scaled in z-values that measure the number of standard deviations
a point is from the mean. Values above the mean have positive z-values and below the
mean have negative z-values.

Standardised Normal z-Value:


𝑥−𝜇
ɀ= where:
𝜎

Utilising the standardised normal z-value equation, one can convert any normal distribution
into a standard normal distribution.

Any normal distribution (with any mean and any standard deviation) can be scaled into the
standard normal distribution (z).
Any specified value, x, from the population distribution can be converted into a corresponding
z-value.

By converting to a standard normal distribution, existing probabilities associated with


different z-values are provided.
Probability Rules for Standard Normal Distribution Table

𝐏(𝐳 ≤ 𝐙) = 0.5 + 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝑖𝑠𝑒𝑑 𝑡𝑎𝑏𝑙𝑒 𝑣𝑎𝑙𝑢𝑒 𝑓𝑜𝑟 𝑍

𝐏(𝐳 ≤ −𝐙) = 0.5 − 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝑖𝑠𝑒𝑑 𝑡𝑎𝑏𝑙𝑒 𝑣𝑎𝑙𝑢𝑒 𝑓𝑜𝑟 𝑍

𝐏(𝐳 ≥ 𝐙) = 0.5 − 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝑖𝑠𝑒𝑑 𝑡𝑎𝑏𝑙𝑒 𝑣𝑎𝑙𝑢𝑒 𝑓𝑜𝑟 𝑍

𝐏(𝐳 ≥ −𝐙) = 0.5 − s𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝑖𝑠𝑒𝑑 𝑡𝑎𝑏𝑙𝑒 𝑣𝑎𝑙𝑢𝑒 𝑓𝑜𝑟 𝑍

𝐏(𝐚 ≤ 𝐙 ≤ 𝐛) = P(Z ≤ b) − P(Z ≤ a)


Finding x From Probability, Mean & Standard Deviation
Step 1. Find the probability that it is
not. For this example: 0.5-0.1 = 0.4
Step 2. Find the corresponding z-
value in the standard table. ~0.4 →
1.28 (corresponding to 0.3997).
𝑥−𝜇
Step 3. Rearrange ɀ = to find x.
𝜎

Finding μ From Probability, x and Standard Deviation


Step 1. Find the probability that it is not.
For this example: 0.5-0.05 = 0.45
Step 2. Find the corresponding z-value in
the standard table. 0.45 → 1.645
(corresponding to the value between 1.64
& 1.65 on the standard table).
𝑥−𝜇
Step 3. Rearrange ɀ = to find μ.
𝜎

Excel and Normal Distributions

Excel has four different functions which can be utilised when dealing with normal distributions:
- NORM.S.DIST(z, cumulative)
normal standard distribution
cumulative = TRUE/FALSE
TRUE = returns the probability
FALSE = returns the point on the
distribution graph line
- NORM.S.INV(probability)
normal standard distribution inverse
Returns the z-value for the given probability.
- NORM.DIST(x, mean, std_Dev, cumulative)
normal distribution (WHEN: the z-value is unknown)
- NORM.INV(probability, mean, std_Dev)
Returns the x-value for the given probability.

Python and Normal Distributions

Functions to deal with normal distributions can be found in the scipy.stats library. One can
import scipy.stats as st.

o st.norm.cdf(x, mean, sd) – .cdf meaning Cumulative Density Function.


Returns the probability of x being less than or equal to the mean for a
normal distribution with the specified mean and standard deviation.
o st.norm.sf(x, mean, sd) – Returns the probability of x being greater than
or equal to the mean for a normal distribution with the specified mean
and standard deviation.
Excel/Python Examples
Further Python and Normal Distributions

To find the x-value for any probability under a normal distribution, one can continue to use
functions from the norm module of the scipy.stats python library.

One can either import the entire library (import scipy.stats as st) or import just the norm module
(from scipy.stats import norm).

o norm.ppf(p, mean, sd) – Returns the x-value for which the probability of x being below
that value is P, for a normal distribution with the specified mean and standard deviation
(specify a mean of 0 and a standard deviation of 1 to present as a standard normal
distribution).
o norm.isf(p, mean, sd) - Returns the x-value for which the probability of x being above
that value is P, for a normal distribution with the specified mean and standard deviation
(specify a mean of 0 and a standard deviation of 1 to present as a standard normal
distribution).

Inverse Calculation Example (Excel & Python)

Other Continuous Probability Distributions

Uniform Probability Distribution – (all numbers have


the same probability) distribution of little information:
the probability over any interval of continuous
random variables is the same for any other interval of
the same width.

Exponential Probability
Distribution – used to
measure the time that
elapses between two
occurrences of an event.
Week 7
What is Sampling?

Sampling – is a process used in statistical analysis in which a


predetermined number of observations are taken from a larger
population.
Why take a sample? Because it is generally expensive, time consuming and complicated
to measure a whole population.

Statistical Sampling

Statistical Sampling – where items of the sample are chosen based on known or calculable
probabilities.
Simple Random Sampling – where every possible sample of a given size has an equal
chance of being selected (can be selected using a table of random numbers).
Stratified Random Sampling – where one divides the population into subgroups (called
strata), according to some common characteristic, then selects a simple random
sample from each subgroup, then combines the samples from subgroups into one.

Systematic Random Sampling – where one decides the sample size, calculates the
number of subgroups of the population from that number, then randomly selects an
individual from the first group, then selecting the individual belonging to the same
location in every subsequent group.

Cluster Sampling – when one divides a population into several “clusters”, each
representative of the population, then selects a simple random sample of clusters to be
the overall sample.
Sampling Error: What it is and Why it Happens

A sample is likely not a perfect representation of a population.

Sampling Error – the difference between a measure computed from a sample (a statistic), and
the corresponding measure computed from the population (a parameter).
The size of a sampling error depends on the selected sample, and a sampling error
could occur as a negative or positive value.

To calculate the sampling error, one must calculate the population mean and the sample
mean. These are simple calculations:

To calculate the number of possible samples for a certain sized sample, one must only follow
𝑛!
the following simple equation: 𝑥!(𝑛−𝑥)! where x is the sample size, and n is the population size.
Average Value of Sample Means

THEORUM 1:
For any population, the average value of all possible sample means, computed from all
possible random samples of a given size from the population, will equal the population
mean.
𝝁𝒙̄ ≈ 𝝁
Unbiased Estimator – a characteristic of certain statistics in which the average of
all possible values of the sample statistic equals the parameter.

Standard Deviation of Sample Means

THEORUM 2:
For any population, the standard deviation of the possible sample means, computed
from all possible from all possible random samples of size n, is equal to the population
standard deviation divided by the square root of the sample size (also called standard
error).
𝝈
𝝈𝒙̄ =
√𝒏
Sampling from the Normal Populations

THEORUM 3:
If a population is normally distributed (mean 𝜇 & standard deviation 𝜎), the sampling
distribution of the sample mean (𝑥) will also be normally distributed, with a mean equal
to the population mean (𝜇𝑥 = 𝜇) and a standard deviation equal to the population
𝜎
standard deviation divided by the square root of the sample size (𝜎𝑥 = ).
√𝑛
The Central Limit Theorem

A population may not be normally distributed.

THEORUM 4:
For simple random samples of n observations
taken from a population with mean 𝜇 and
standard deviation 𝜎, regardless of the
population’s distribution, provided the sample
size is sufficiently large, the distribution of the
sample means, 𝑥, will be approximately normal
with a mean equal to the population mean (𝜇𝑥 =
𝜇) and a standard deviation equal to the
population standard deviation divided by the
𝜎
square root of the sample size (𝜎𝑥 = ).
√𝑛
The larger the sample size, the better the
approximation to the normal distribution.

The sample size must be sufficiently large. If it is not,


then the estimated normal distribution will be skewed
unfavourably.

For a population which is quite symmetric, a sample size


as small as 2 or 3 may be sufficient. However, if a
population is largely skewed, or otherwise irregularly
shaped, the sample size must be larger.

A conservative definition of a “sufficiently large” sample


size is: n ≥ 30

Z-Value for Sampling Distribution of 𝑥

The relative distance that a given sample mean is from the centre can be determined by
standardising the sampling distribution.
Point and Confidence Interval Estimates for a Population Mean

Point Estimate – a single statistic, determined from a sample, that is used to estimate the
corresponding population parameter.

Sampling Error – the difference between a measure computed from a sample (a


statistic), and the corresponding measure computed from the population (a parameter).

Confidence Interval – an interval developed from sample values


such that if all possible intervals of a given width are constructed, a
percentage of these intervals known as the confidence level, will
include the true population parameter.
Utilised to account for sampling error.
Interval estimate – an interval, or range of values, used to estimate a population
parameter.

Confidence Interval

Confidence Level – the percentage of all possible confidence intervals that will contain the true
population mean.

There are multiple scenarios and hence, methods, to determine the confidence level of a
certain confidence interval.

METHOD 1: Confidence interval for the population mean where population standard deviation
is known.
Case 1 – the simple random sample is drawn from a normal distribution.
Case 2 – the population does not have a normal distribution, or the distribution
of the population standard deviation is unknown.
Both times, the sampling distribution for the sample mean is assumed to be normally
distributed.
The z-value (standardised value) for this calculation is known as the “critical value”.

95% of 𝑥 values will fall within ±1.96𝜎𝑥 of the population mean. Thus, 95% of the confidence
intervals will include the population mean.

Critical values (z-values (standardised values)) can be found utilising the standard normal table,
or the NORM.S.INV() function in Excel.

Margin of Error – the amount that is added or subtracted from the point
estimate to determine the endpoints of the confidence interval. Can be reduced
by increasing the sample size.
METHOD 2: Confidence interval for the population mean where population standard deviation
is unknown.
In most cases, if the population mean is unknown, so is the population standard deviation. In
such cases, more uncertainty is introduced, and the estimation process must be modified to
account for such.
When the population standard deviation is unknown, the critical value is a t-value, taken from a
family of distributions known as the Student’s t-Distributions.

Student’s t-Distributions – a family of distributions that are bell-shaped and symmetric like the
standard normal distribution but with greater area in the tails. Each distribution in the t-family is
defined by its degrees of freedom. As the degrees of freedom increase, the t-distribution
approaches the normal distribution.
Degrees of Freedom – the number of independent data values available to estimate the
population’s standard deviation. If k-parameters must be estimated before the
population’s standard deviation can be calculated from a sample size n, the degrees of
freedom are equal to 𝑛 − 𝑘 (k assumed to be 1).

The t-distribution is based on the assumption that the population is normally distributed.
However, as long as the population is reasonably symmetric, one can utilise t-distribution.

Interpreting the Results of Confidence Interval Calculations


One must never state that there is a “probability” that the population mean is within the
confidence interval.

The mean is a fixed number that is either within the confidence interval or not. Hence, one must
word their statement differently than “probability”.
Confidence Levels/Intervals in Excel

There are multiple methods in Excel.

Data Analysis Method:


Go to the data analysis tool, select “descriptive
statistics”. You can input the data range, and the
specify what confidence level you want.
The tool will return the Margin of Error value (using the
specific t-value).

Example:

Excel Functions CONFIDENCE.NORM() and CONFIDENCE.T()

Excel offers functionality to return the Margin of Error value utilising the t-values or regular
normal distribution values. Where:
Syntax: alpha = 1 − [𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑙𝑒𝑣𝑒𝑙]
CONFIDENCE.NORM([alpha], [standard deviation], [size]) standard deviation = relevant
standard deviation (population
CONFIDENCE.T([alpha], [standard deviation], [size]) vs sample)
size = sample size

Confidence Level/Interval with Python

Utilising the scipy.stats Python Library

When utilising z-value:


scipy.stats.norm.interval([confidence level decimal], [degrees of freedom], [mean],
[population standard error])

When utilising t-value:


scipy.stats.t.interval([confidence level decimal], [degrees of freedom], [mean], [sample
standard error])
𝝈
Standard Error = 𝒏

To determine the standard error, one must utilise the math library to calculate the square root of
the sample size (math.sqrt([sample size])).

Example:
Sample Size

On occasion, one may desire to determine the optimal sample size for a given confidence
level/margin of error. To do so, one must only rearrange the margin of error equation.
𝜎 𝑧𝜎
𝑒=𝑧 → 𝑛 = ( )2
√𝑛 𝑒

Example:

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy