FIN10002 - Notes Master
FIN10002 - Notes Master
FIN10002 - Notes Master
Excel: Python:
- FREQUENCY() - pandas.value_counts()
- COUNTIF() - matplotlib.pyplot.hist()
- VSTACK() - matplotlib.pyplot.show()
- AVERAGE() - matplotlib.pyplot.savefig()
- MEDIAN() - pandas/numpy.mean()
- MODE() - pandas/numpy.median()
- PERCENTILE() - pandas/numpy.mode()
- QUARTILE() - pandas.quantile()
- MAX() - numpy.quantile()
- MIN() - numpy.percentile()
- STDEV() - pandas.std()
- VAR() - pandas.var()
- SKEW() - pandas.describe()
- STANDARDIZE() - numpy.ptp()
- NORM.S.DIST() - scipy.stats.rv_discrete.mean()
- NORM.S.INV() - scipy.stats.rv_discrete.var()
- NORM.DIST() - scipy.stats.rv_discrete.std()
- NORM.INV() - scipy.stats.norm.cdf()
- CONFIDENCE.NORM() - scipy.stats.norm.sf()
- CONFIDENCE.T() - scipy.stats.norm.ppf()
- - scipy.stats.norm.isf()
- - scipy.stats.norm.interval()
- - scipy.stats.t.interval()
- - math.sqrt()
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
Equations:
LECTURE NOTES WEEK 1
Introduction To Statistics
The success or failure of a business hinges largely on the decisions made by the people within
it. Information is integral to decision making, and information is often the outcome of data
collection, analysis, interpretation and reporting.
Data is part of the equation of information, but it is not information in it of itself. TO BE USEFUL,
data must be gathered, processed, stored, manipulated, and tested, using various statistical
methods.
Financial Statistics – a collection of procedures and techniques used to convert data into
meaningful information in a financial (business) environment.
Statistics – is a mathematical science concerned with data collection, analysis,
interpretation, explanation and presentation.
Statistical Procedures
Inferential – tools and techniques that help decision makes to draw inferences from a set of
data.
▪ Estimation (e.g. estimating the population mean
weight using the sample mean weight)
▪ Hypothesis testing (e.g. using sample evidence
to test the claim that the population mean weight
is 75kg)
Procedures for Collecting Data
Population VS Sample
Experiment – a process that produces a single outcome whose result cannot be predicted with
certainty.
Experimental Design – a plan for performing an experiment in which the variable of
interest is defined.
Direct Observations – data being collected is physically observed and recorded based on what
takes place in the process.
Subjective and time consuming.
Personal Interviews – can be structured (questions are scripted) or unstructured (begin with
one or more broadly stated questions, with further questions being based on responses).
Data Timing
Cross-sectional Data – data that is
collected at a fixed point in time.
E.g.: Businesses often conduct surveys
to gauge consumer sentiment about a
new product. This captures data at a
single point in time.
Qualitative – data whose measurement scale is inherently categorical (e.g. martial status,
political affiliation, eye colour).
Ratio Data – data that has all the characteristics of interval data
but has a true zero value/point.
Examples: weight, time, pay rate per hour, interest rates.
Week 2
Frequency Distributions & Histograms
Frequency Distribution – a summary of data presented in the form of class intervals and their
corresponding frequency.
A set of data that displays the number of observations in each of the distribution’s
distinct categories/classes. Is a list or table which contains the values of a variable (or
set of ranges within which the data falls). Contains the frequencies of with which each
value occurs.
Un-grouped Data – (raw data), is data which has not been summarised in any
way.
Grouped Data – is data which has been organised into a frequency distribution.
Discrete Data – data that can take on a countable number of possible values (e.g. student ages,
amazon product categories, etc.).
Continuous Data – data whose possible values are uncountable and that may assume any
value in an interval (weight, length, time, etc.).
Relative Frequency
Relative Frequency – the proportion of total observations that are in a given category.
𝑓𝑖
𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 = In simple terms:
𝑛
Frequency of values
𝑓𝑖 = Frequency of the ith value of the discrete variable.
𝑘 Total number of observations
Frequency Function
Syntax:
=FREQUENCY([range/array of data],[bins/class array])
CountIf Function
Syntax:
=COUNTIF([range/array of data],[criteria])
There are certain criteria which must be followed when building classes/bins:
- Classes must be mutually exclusive (not overlapping).
- Classes must be all-inclusive (a set of classes should include all possible data values).
- Classes should be of equal width, if possible (the distance between the lowest and
highest values of each class should be equal).
- Empty classes should be avoided.
𝟐𝒌 ≥ 𝒏 RULE
Where k is the number of classes and is defined as the smallest integer so that 2𝑘 ≥ 𝑛, where n
is the number of data values.
Cumulative Frequency Distribution – a summary of a set of data that displays the number of
observations with values less than or equal to the upper limit of each class.
Cumulative Relative Frequency Distribution – a summary of a set of data that displays the
proportion of observations with values less than or equal to the upper limit of each class.
The frequency function, pivot table and data analysis both worth with cumulative as well.
Python Frequency Analysis
[Name].value_counts()
Grouping data is very simple in python. You can just specify how many bins you desire. It is
important though, that you name the column which will have bins in it (in this example:
data[‘Data’].value_counts)
Histograms
Histogram – a graph of frequency distribution with the horizontal axis showing the classes and
the vertical axis showing the frequency counts (a visual representation of a frequency
distribution). Important: no gaps between the bar in the graph.
To chart histograms, one must import the library/module: matplotlib.pyplot (as plt)
Example:
Summary
Week 3
Central Tendency
Measures of Central
Tendency – (specifies where
the data is centred), is a
summary measure that
attempts to describe a whole
set of data with a single value
that represents the
middle/centre of its
distribution.
Measures of Location – includes the measures of central tendency, but also other measures
that illustrate the location or distribution of the data.
Mean
Mean – also known as ‘average’. There are many different types of means:
❖ population,
❖ sample,
❖ weighted,
❖ geometric,
❖ quadratic,
❖ harmonic.
❖ The most common: ARITHMETIC mean - the sum of the data points, divided by the
number of the data points.
One can easily perform an average calculation with the function AVERAGE() in Excel.
Median
Median – is a centre value that divides a data array into two halves. In an ordered array, the
median is the “middle” number (50% of the data is above the median, 50% is below).
Data array – data that has been arranged in numerical order.
The median is not affected by outlier values.
One can find the median with the MEDIAN() function in Excel.
Symmetric Data – data sets whose values are evenly spread around the centre.
Mode
Mode can be found with the MODE() function in Excel. There is MODE.MULT() which will return
all the modes, or MODE.SNGL() which will only return one mode.
Weighted Mean
Weighted Mean – the mean value of data values that have been weighted according to their
relative importance.
Example:
Other Means
Geometric Mean – most frequently used to average rates of change over time or to compute the
growth rate of a variable.
Harmonic Mean – weighted mean in which observation’s weight is inversely proportional to its
magnitude.
To apply it to a document, one can use Pandas to read the document, then one must specify
which column/dataset one is calculating from, then one can PRINT, the calculated value.
Example:
Advantages and Disadvantages of Each Measure
Calculating Percentiles
Steps:
1. Sort the data from lowest to highest.
𝑝
2. Determine the percentile location index: 𝑖 = (𝑛)
100
3. If 𝑖 is not an integer, then round to the next highest integer. The pth percentile is located at
the rounded index position. If 𝑖 is an integer, the pth percentile is the average of the
values at location index positions 𝑖 and 𝑖 + 1.
Example:
Percentiles in Excel
In Python’s Pandas library, the function .quantile(q=XX) is used to calculate the percentile. There
is also many ways you can specify the type of interpolation you desire.
The Python library Numpy also has percentile functions. They are
numpy.quantile([data],[percent]) and numpy.percentile([data,[percent]). The .quantile function
is the QUARTILE function.
To utilise Numpy to analyse a dataset, one will still need to import the Pandas library to read the
file.
Measures of Variance/Dispersion
Variation
Variation – a set of data exhibits variation if all/any of the data is not the same value.
Range
Range – a measure of variation that is computed by finding the difference between the
maximum and minimum values in a data set.
- The simplest measure of variation.
- Is very sensitive to extreme values/outliers.
- Ignores data distribution.
Interquartile Range – the range between the 1st and 3rd quartiles of data (the inner 50%). Used
when data is very spread, but most values are in the interquartile.
▪ Attempts to counteract some imbalance caused by extreme outlier values.
▪ Eliminates high- & low- valued observations.
𝐼𝑛𝑡𝑒𝑟𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑅𝑎𝑛𝑔𝑒 = 𝑄3 − 𝑄1
Population Variance
Population Variance – the average of the squared distances of the data values from the mean.
Shortcut formula:
X is each value in the dataset. So for the shortcut formula, it is sum of values2.
Sample Variance – variance but for the sample not the population.
Sample Standard Deviation – standard deviation but for the sample not the population.
Example:
Standardised Data Values
Standardised Data Values – the number of standard deviations a value is from the mean. Can
sometimes be referred to as z-scores.
One can use the STANDARDIZE() function to get a z-score calculated. You must provide the
mean and the standard deviation, as well as the data-point you wish to calculate with.
Excel Measures of Variance
Standard Deviation: is calculated with the STDEV() function. The Population Standard
Deviation is calculated with the function STDEV.P() and the Sample Standard Deviation is
calculated with the STDEV.S() function.
Variance: can be calculated with the VAR() function. The Population Variance is calculated with
the function VAR.P() and the Sample Variation is calculated with the VAR.S() function.
Coefficient of Variation (CV) – the ratio of the standard deviation to the mean, expressed as a
percentage. The coefficient of variation is used to measure variation relative to the mean.
▪ Measures relative variation.
▪ Always expressed as a percentage.
▪ Shows variation relative to the mean.
▪ Is used to compare two or more sets of data.
Skewness – a measure of how much data has strayed from a perfect bell curve distribution.
𝑛 𝑥𝑖 −𝑥 3
One can calculate skewness with the formula (𝑛−1)(𝑛−2) ∑( 𝑠
) or with the Excel function
SKEW().
If a skewness value is positive: the distribution is positively skewed (skewed to the right).
If a skewness value is negative: the distribution is negatively skewed (skewed to the left).
In the data analysis tool, you can select ‘descriptive statistics’. You must input your data, and
then at least select ‘summary statistics’.
Make sure to double check that the “count” figure is the same figure as the number of data
points you want to analyse. If it is wrong, then all the outputted figures will be wrong also.
Range: .ptp([data_array])
Probability – the chance that a particular event will occur. A probability value will be in a range
from 0-1 (0%-100%). If there is a probability of 1 (100%), then there is a certainty that an event
will occur. A probability of 0 (0%) represents there being no chance.
Experiment – a process that produces a single outcome whose result cannot be predicted with
certainty.
Sample Space – the collection of all outcomes that can result from a selection, decision, or
experiment.
Examples:
- All 6 faces of a die,
- All 52 cards of a deck of cards.
One can also restrict the outcomes, for example: no double ups, in which case the outcomes
will differ slightly.
Using a Tree Diagram to Define the Sample Space
Step 2: Define the outcomes for a single trial of the experiment: e.g. Yes, No
Step 3: Define the sample space for the number of trials using a tree diagram:
Venn Diagrams
Venn Euler diagrams can be used to express in graphical form the sample space and the event.
Step 2: List the outcomes associated with one trial of the experiment.
Step 4: Define the event of interest: a collection of the outcomes possible in the sample
space, often defined by a certain condition.
Types of Events
Mutually Exclusive Events – two events are mutually exclusive if the occurrence of one event
precludes the occurrence of the other event. If events have no outcomes in common, they are
said to be mutually exclusive.
Independent Events – two events are independent if the occurrence of one event in no way
influences the probability of the occurrence of the other event.
Dependent Events – two events are dependent if the occurrence of one event impacts the
probability of the other event occurring.
Assigning Probability
Classical Probability Assessment – the method of determining probability based on the ratio
of the number of ways an outcome or event of interest can occur to the number of ways any
outcome or event can occur when the individual outcomes are equally likely.
Think of a fair coin. There are two possible outcomes. The probability of landing heads is ½.
Relative Frequency Probability Assessment
Relative Frequency Probability Assessment – the method that defines probability as the
number of times an event occurs divided by the total number of an experiment is performed in a
large number of trials.
This uses historical data.
Subjective Probability Assessment – the method that defines the probability of an event as
reflecting a decision maker’s state of mind regarding the chances that the particular event will
occur.
The subjective probability is a measure of a personal conviction that an event will occur,
representing a person’s belief that an event will occur.
Probability as Odds
Probabilities are often stated as odds (a ratio, e.g. 3:2) for or against a given event occurring.
Probability Rule 1:
The probability of an event occurring is always between 0 and 1.
Probability Rule 2:
The sum of the probabilities of all possible outcomes is 1.
Complement Rule
The complement of event E is the collection of all possible outcomes not contained in event E.
Complement – the probability of an event not occurring.
Random Variable – a variable that is subject to randomness, meaning it can take on different
variables. It takes on different numerical values based on the chance of some event (e.g. events
from a random experiment).
Discrete Random Variable – can only assume a finite number of values or an infinite sequence
of values (e.g. 0, 1, 2, 3…). Whole number.
There could be many different outcomes (e.g., number of complaints per day) or only
two possible outcomes (e.g. defective item: yes or no).
One can calculate the mean by taking the sum of each variable multiplied by its respective
relative frequency.
Standard Deviation
For discrete variables, to find mean, variance and standard deviation, one can use the
rv_discrete function from the scipy.stats python library.
The scipy.stats library is quite large, so to be efficient, instead of importing the whole library,
one can just import the necessary function/s:
The following functions then can be used to calculate the mean, variance and standard
deviation:
- XXXX.mean()
- XXXX.var()
- XXXX.std()
You must define the value range for “XXXX/discvar/rv_discrete/whatever you use to define your
values”.
Continuous Distributions – random variables that can take values at every point over a given
interval.
A normal probability distribution is asymptotic to the x-axis, and the amount of variation in the
random sample determines the height and spread of the normal distribution.
The probability for a range of values between A and B is defined as the area under the curve
between the two points.
Standard Normal Distribution – a normal distribution that has a mean = 0 and a standard
deviation = 1.
The horizontal axis is scaled in z-values that measure the number of standard deviations
a point is from the mean. Values above the mean have positive z-values and below the
mean have negative z-values.
Utilising the standardised normal z-value equation, one can convert any normal distribution
into a standard normal distribution.
Any normal distribution (with any mean and any standard deviation) can be scaled into the
standard normal distribution (z).
Any specified value, x, from the population distribution can be converted into a corresponding
z-value.
Excel has four different functions which can be utilised when dealing with normal distributions:
- NORM.S.DIST(z, cumulative)
normal standard distribution
cumulative = TRUE/FALSE
TRUE = returns the probability
FALSE = returns the point on the
distribution graph line
- NORM.S.INV(probability)
normal standard distribution inverse
Returns the z-value for the given probability.
- NORM.DIST(x, mean, std_Dev, cumulative)
normal distribution (WHEN: the z-value is unknown)
- NORM.INV(probability, mean, std_Dev)
Returns the x-value for the given probability.
Functions to deal with normal distributions can be found in the scipy.stats library. One can
import scipy.stats as st.
To find the x-value for any probability under a normal distribution, one can continue to use
functions from the norm module of the scipy.stats python library.
One can either import the entire library (import scipy.stats as st) or import just the norm module
(from scipy.stats import norm).
o norm.ppf(p, mean, sd) – Returns the x-value for which the probability of x being below
that value is P, for a normal distribution with the specified mean and standard deviation
(specify a mean of 0 and a standard deviation of 1 to present as a standard normal
distribution).
o norm.isf(p, mean, sd) - Returns the x-value for which the probability of x being above
that value is P, for a normal distribution with the specified mean and standard deviation
(specify a mean of 0 and a standard deviation of 1 to present as a standard normal
distribution).
Exponential Probability
Distribution – used to
measure the time that
elapses between two
occurrences of an event.
Week 7
What is Sampling?
Statistical Sampling
Statistical Sampling – where items of the sample are chosen based on known or calculable
probabilities.
Simple Random Sampling – where every possible sample of a given size has an equal
chance of being selected (can be selected using a table of random numbers).
Stratified Random Sampling – where one divides the population into subgroups (called
strata), according to some common characteristic, then selects a simple random
sample from each subgroup, then combines the samples from subgroups into one.
Systematic Random Sampling – where one decides the sample size, calculates the
number of subgroups of the population from that number, then randomly selects an
individual from the first group, then selecting the individual belonging to the same
location in every subsequent group.
Cluster Sampling – when one divides a population into several “clusters”, each
representative of the population, then selects a simple random sample of clusters to be
the overall sample.
Sampling Error: What it is and Why it Happens
Sampling Error – the difference between a measure computed from a sample (a statistic), and
the corresponding measure computed from the population (a parameter).
The size of a sampling error depends on the selected sample, and a sampling error
could occur as a negative or positive value.
To calculate the sampling error, one must calculate the population mean and the sample
mean. These are simple calculations:
To calculate the number of possible samples for a certain sized sample, one must only follow
𝑛!
the following simple equation: 𝑥!(𝑛−𝑥)! where x is the sample size, and n is the population size.
Average Value of Sample Means
THEORUM 1:
For any population, the average value of all possible sample means, computed from all
possible random samples of a given size from the population, will equal the population
mean.
𝝁𝒙̄ ≈ 𝝁
Unbiased Estimator – a characteristic of certain statistics in which the average of
all possible values of the sample statistic equals the parameter.
THEORUM 2:
For any population, the standard deviation of the possible sample means, computed
from all possible from all possible random samples of size n, is equal to the population
standard deviation divided by the square root of the sample size (also called standard
error).
𝝈
𝝈𝒙̄ =
√𝒏
Sampling from the Normal Populations
THEORUM 3:
If a population is normally distributed (mean 𝜇 & standard deviation 𝜎), the sampling
distribution of the sample mean (𝑥) will also be normally distributed, with a mean equal
to the population mean (𝜇𝑥 = 𝜇) and a standard deviation equal to the population
𝜎
standard deviation divided by the square root of the sample size (𝜎𝑥 = ).
√𝑛
The Central Limit Theorem
THEORUM 4:
For simple random samples of n observations
taken from a population with mean 𝜇 and
standard deviation 𝜎, regardless of the
population’s distribution, provided the sample
size is sufficiently large, the distribution of the
sample means, 𝑥, will be approximately normal
with a mean equal to the population mean (𝜇𝑥 =
𝜇) and a standard deviation equal to the
population standard deviation divided by the
𝜎
square root of the sample size (𝜎𝑥 = ).
√𝑛
The larger the sample size, the better the
approximation to the normal distribution.
The relative distance that a given sample mean is from the centre can be determined by
standardising the sampling distribution.
Point and Confidence Interval Estimates for a Population Mean
Point Estimate – a single statistic, determined from a sample, that is used to estimate the
corresponding population parameter.
Confidence Interval
Confidence Level – the percentage of all possible confidence intervals that will contain the true
population mean.
There are multiple scenarios and hence, methods, to determine the confidence level of a
certain confidence interval.
METHOD 1: Confidence interval for the population mean where population standard deviation
is known.
Case 1 – the simple random sample is drawn from a normal distribution.
Case 2 – the population does not have a normal distribution, or the distribution
of the population standard deviation is unknown.
Both times, the sampling distribution for the sample mean is assumed to be normally
distributed.
The z-value (standardised value) for this calculation is known as the “critical value”.
95% of 𝑥 values will fall within ±1.96𝜎𝑥 of the population mean. Thus, 95% of the confidence
intervals will include the population mean.
Critical values (z-values (standardised values)) can be found utilising the standard normal table,
or the NORM.S.INV() function in Excel.
Margin of Error – the amount that is added or subtracted from the point
estimate to determine the endpoints of the confidence interval. Can be reduced
by increasing the sample size.
METHOD 2: Confidence interval for the population mean where population standard deviation
is unknown.
In most cases, if the population mean is unknown, so is the population standard deviation. In
such cases, more uncertainty is introduced, and the estimation process must be modified to
account for such.
When the population standard deviation is unknown, the critical value is a t-value, taken from a
family of distributions known as the Student’s t-Distributions.
Student’s t-Distributions – a family of distributions that are bell-shaped and symmetric like the
standard normal distribution but with greater area in the tails. Each distribution in the t-family is
defined by its degrees of freedom. As the degrees of freedom increase, the t-distribution
approaches the normal distribution.
Degrees of Freedom – the number of independent data values available to estimate the
population’s standard deviation. If k-parameters must be estimated before the
population’s standard deviation can be calculated from a sample size n, the degrees of
freedom are equal to 𝑛 − 𝑘 (k assumed to be 1).
The t-distribution is based on the assumption that the population is normally distributed.
However, as long as the population is reasonably symmetric, one can utilise t-distribution.
The mean is a fixed number that is either within the confidence interval or not. Hence, one must
word their statement differently than “probability”.
Confidence Levels/Intervals in Excel
Example:
Excel offers functionality to return the Margin of Error value utilising the t-values or regular
normal distribution values. Where:
Syntax: alpha = 1 − [𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑙𝑒𝑣𝑒𝑙]
CONFIDENCE.NORM([alpha], [standard deviation], [size]) standard deviation = relevant
standard deviation (population
CONFIDENCE.T([alpha], [standard deviation], [size]) vs sample)
size = sample size
Example:
Sample Size
On occasion, one may desire to determine the optimal sample size for a given confidence
level/margin of error. To do so, one must only rearrange the margin of error equation.
𝜎 𝑧𝜎
𝑒=𝑧 → 𝑛 = ( )2
√𝑛 𝑒
Example: