0% found this document useful (0 votes)
14 views81 pages

Final 2nd MAT1243 Handout 2023 Ac Year

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views81 pages

Final 2nd MAT1243 Handout 2023 Ac Year

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Module code: MAT1243

Module Title: Descriptive Statistics and Probability

Academic Year: 2023


Audience: MPE, MCE, MBE, MCsE, MGE & MEE
Module Overview
➢Content is allocated in weeks from 19/10/2023 to 18/01/2024
➢The content is also divided into 3 main units with sections
under each unit:

Unit 1: Descriptive statistics; √ completely done

Unit 2: Bivariate Analysis; √ currently ongoing

Unit 3: Introduction to Probability.


Unit 2: Bivariate Analysis
Unit 2 Objectives
By the end of this unit 2, student will be able to:
➢ Understand the meaning of Bivariate analysis
in Statistics (i.e bivariate analysis meaning and
bivariate analysis as a methods in statistics);
➢Use types of correlation and methods of studying
correlation including Scatter plot, covariance,
coefficients of correlation;
➢Understand and use the Regression analysis as one
technique/method to investigate relationship between
two quantitative variables.
2.1 Meaning of Bivariate Analysis
1st : What is Bivariate analysis:
Bivariate analysis, or double series, refers to the analysis of two variables
(dependent-𝒚 & independent-𝒙) to determine relationships between them.
On one hand, that analysis explores how the dependent (“outcome”)
variable depends or is explained by the independent (“explanatory”)
variable (i.e asymmetrical analysis).
On the other hand, that analysis explores the association between two
variables without any cause and effect relationship (i.e symmetrical analysis).
2nd : What is the bivariate analysis as a method in statistics:
This analysis is a helpful technique for determining how two variables are
connected and finding trends and patterns in the data. In fact, the bivariate
analysis aims to determine if there is a statistical link between the two
variables and, if so, how strong and in which direction that link is.
Meaning of Bivariate Analysis, Cont’D;
• For example,
Relationship between:
- age and weight,
- weight and height,
- years of education and salary,
- amount of daily exercise and cholesterol level, etc.
As for presenting, bivariate data are described using both graphical and
numerical ways.
In both cases our purpose will be primarily concerned with determining
whether there is a linear relationship between the two quantitative
variables under consideration or not.
Meaning of Bivariate Analysis, Cont’D;
That is as important keys in bivariate analysis;
• Bivariate data is observation on two variables, whilst univariate data
is an observation on only one variable.
• The use of the work done. As for 2nd meaning of bivariate analysis
above, we normally collect bivariate data to try and investigate the
relationship between the two variables and then use this
relationship to inform future decisions.

• And finally, it is said that bivariate analysis, or simply bivariate


statistics deals with the collection, organization, analysis,
interpretation, and drawing of conclusions from bivariate data.
Meaning of Bivariate Analysis, Cont’D;

The most used techniques for investigating the


relationship between two quantitative variables are
correlation and linear regression.
➢ Correlation quantifies the strength of the linear
relationship between a pair of variables,
➢ whereas regression expresses the relationship in
the form of an equation.
Details on next pages.
2.2 Correlation Analysis
Correlation is the study of association or degree of
relationship between the two variables under study.
Or simply, correlation is a statistical measure of the
relationship between the two variables’ relative
movements.

For example:
Relationship between price and supply, yield of crop
and fertilizer input, etc.
Correlation Analysis, Cont’D.
If two variables vary in such a way that movement in
one is accompanied by movement in the other, they are
said to be correlated.

Once now, variables are correlated, the points will fall


along a line or curve (to be detailed later).
Concerning the correlation quality: The better the
correlation, the closer the points will touch the line.
2.2.1 Recall on one Diagrammatic and Graphic Presentation
Scatter plot
• As from unit 1, the Scatter plot is a graph that present the relationship
between two variables in data set. That is, it is a diagram to represent
the relationship between two numeric measurements. It, therefore,
represent data points on a Cartesian system.
• The independent variables or attribute is plotted on the X-axis while
the dependent variable is plotted on Y-axis. The plots are called scatter
plot, scatter graphs, scattergram or scatter diagrams.
• Scatter plot is beneficial for a large set of data points, each set
comprises a pair of values, the given data is in numerical form
• The line drawn in a scatter plot, which is near to almost all points in
the plot is known as line of best fit or trend line
Recall on Scatter plot, Cont’D;
Example 1 Answer 1
Scatter plot of Fertilizer used and Yield
The following table contains data on
quantity of fertilizer used in 6 identical 250

plots and Yield obtained from each of 200

them. You are asked to make a scatter plot


150

Yield (Kg)
100

Quantity of
Fertilizer (Kg) 10 12 7 15 20 13 50

0
0 5 10 15 20 25

Yield (in Kg) 100 130 92 145 200 150 Quantity of fertilizer (Kg))
Recall on Scatter plot, Cont’D;
Example 2:
Consider the following data which relate 𝑥, the respective number of
branches that 10 different banks have in a given common market,
with 𝑦, the corresponding market share of total deposits held by the
banks:

You are asked to present these data on scatter plot.


Recall on Scatter plot, Cont’D;
Answer 2: The scatter plot diagram of the above data is given by
Recall on Scatter plot, Cont’D;
Example 3:
Construct a scatter plot for the data obtained in a study on the
number of absences and the final grades of seven randomly selected
students from a statistics class. The data are shown here.
Recall on Scatter plot, Cont’D.
Answer 3:
Step 1 Draw and label the 𝑥 and 𝑦 axis.
Step 2 Plot each point on the graph
2.2.2 Types of correlation
The correlation can first be linear (positive, negative), curvilinear and
non correlation.
1st Linear:
If the amount of change in one variable tends to bear constant
ratio to the amount of change in the other variable, then the
correlation is said to be linear. Discussed down here:
➢ It is positive If both the variables vary in the same direction i.e. if one
increases other also increases or decreases other also decreases,
➢ It is negative if the variables vary in the opposite directions i.e. if one
increases other decreases
➢ When only two variables are studied it is known as simple correlation
(zero order correlation).
Types of correlation cont’d.
➢ When three or more variables are studied it is either partial or multiple correlations
depending upon number of restrictions put on the variables.

2nd Curvilinear:
If the change in one variable does not bear a constant ratio to the amount of
change in the other variable, then the correlation is said to be non-linear or
curvilinear correlation.
3rd No correlation:
When the points are scattered all over the graph and it is difficult to conclude whether the
values are increasing or decreasing then there is no correlation between the variables.
2.2.3 Linear correlation and no
correlation in details
Positive correlation
This happened when the points in the graph are rising, moving from left to right.
It means that the values of one variable are increasing with respect to another.

Positive correlation; both data sets increase together (linear).


Negative correlation
This means that the values of one variable are decreasing with respect to
another.

Negative correlation; as one data set increases, the other


decreases (linear).
No correlation
When the points are scattered all over the graph and it is difficult to
conclude whether the values are increasing or decreasing, then there
is no correlation between the variables.

No correlation; there is no relationship between the data


(nonlinear).
Note: Correlation describes the type of relationship between two
data sets. The line of best fit is the line that comes closest to all the
points on a scatter plot.
One way to estimate the line of best fit is to lay a ruler’s edge over
the graph and adjust it until it looks closest to all the points.
Course 3
2.2.4 Some Methods of studying Correlation
1st Method: Scatter Diagram:
• As explained above in the recall, the scatter diagram is a graph of the ordered pairs
(𝑥, 𝑦) of the independent variable 𝑥 and the dependent variable 𝑦.

• By looking to the scatter of various points one can form an idea whether the two
variables are correlated or not.

• The greater the scatter of points the lesser is the association / relationship between
two variables.

• The more closer the points come to a straight line the correlation is said to be higher
Methods of studying correlation: 1st. Scatter Diagrams, Cont’D.
Different examples for Scatter diagram to determine the types of correlation

i. Consider the following data which relate 𝑥, the respective number of


branches that 10 different banks have in a given common market, with
𝑦, the corresponding market share of total deposits held by the banks:

If each point (𝑥, 𝑦) of the data is plotted in a plane, the scatter plot or
Scatter diagram is obtained.
The scatter plot or scatter diagram (in the figure above) indicates that,
roughly speaking, the market share increases as the number of
branches increases. We say that 𝑥 and 𝑦 have a positive correlation.
ii. On the other hand, consider the data below, which relate
average daily temperature 𝑥 in degrees Fahrenheit, and daily
natural gas consumption 𝑦 in cubic meter.
We see that y tends to decrease as x increases. Here, 𝑥 and 𝑦 have a
negative correlation
iii. Consider the data items (𝑥, 𝑦) below, which relate daily temperature
𝑥 over a 10-day period to the Dow Jones stock average 𝑦: (63, 3385);
(72, 3330); (76, 3325); (70, 3320); (71, 3330); (65, 3325); (70, 3280);
(74, 3280); (68, 3300); (61, 3265).

There is no apparent relationship between 𝑥 and 𝑦 (no correlation


correlation).
2nd Method: Covariance:
Covariance, as 2nd method of studying correlation, measures
the direction of the relationship between two variables.
• A positive covariance means that both variables tend to be
high or low at the same time. A negative covariance means
that when one variable is high, the other tends to be low.
• The covariance of variables 𝑥 and 𝑦, denoted 𝑪𝒐𝒗(𝒙, 𝒚) is a
measure of how these two variables change together.
• If the greater values of one variable mainly correspond with
the greater values of the other variable, and the same holds
for the smaller values, i.e. the variables tend to show similar
behavior, the covariance is positive.
Covariance Cont’D
• In the opposite case, when the greater values of one
variable mainly correspond to the smaller values of the
other, i.e. the variables tend to show opposite behavior,
the covariance is negative.
• If covariance is zero the variables are said to be
uncorrelated, it means that there is no linear
relationship between them.
➢Therefore, the sign of covariance shows the tendency in
the linear relationship between the variables. However,
the magnitude of covariance is not easy to interpret.
• The covariance is given by:
Example 1:
Example 2: Find the covariance of the following distribution
3rd Method. Pearson’s Coefficient of Correlation

• Statisticians use a measure called the correlation coefficient to


determine the strength of the linear relationship between two
variables. There are several types of correlation coefficients. The one
explained here is called the Pearson product moment correlation
coefficient (PPMC), named after statistician Karl Pearson, who
pioneered
• The correlation coefficient computed from the sample data measures
the strength and direction of a linear relationship between two
variables. The symbol for the sample correlation coefficient is r. The
symbol for the population correlation coefficient is 𝝆 (Greek letter
rho).
• The range of the correlation coefficient is from -1 to +1. If there is a
strong positive linear relationship between the variables, the value
of r will be close to +1.
• If there is a strong negative linear relationship between the
variables, the value of r will be close to -1.
• When there is no linear relationship between the variables or only a
weak relationship, the value of r will be close to 0.

I.e, the graph shows properties of correlation coefficient: −1 ≤ 𝑟 ≤ 1


• Or, coefficient of correlation can also be given by:
Example 1:
Compute the correlation coefficient for the following data
Answer 1:
Substitute in the formula and solve for r.

The correlation coefficient suggests a strong relationship between


the number of cars a rental agency has and its annual income.
Properties of the coefficient of correlation
a) The coefficient of correlation does not change the measurement
scale. That is, if the height is expressed in meters or feet, the
coefficient of correlation does not change.
b) The sign of the coefficient of correlation is the same as the
covariance.
c) The square of the coefficient of correlation is equal to the product
of angular coefficients (slopes) of two regression lines. In fact,
d) If the coefficient of correlation is known, it can be used to find the
angular coefficients of two regression lines.
Example 2:
From the following data on number of students in class and
average marks in five classes, find the correlation coefficient
and interpret its value

Number of students (𝑥) 15 20 22 25 30


Average marks over 20 (𝑦) 17 15 13 15 12
Answer 2: Set the table as follows:
No. 𝒙 𝒚 𝒙𝒚 𝒙𝟐 𝒚𝟐
1 15 17 255 225 289
2 20 15 300 400 225
3 22 13 286 484 169
4 25 15 375 625 225
5 30 12 360 900 144
Total 112 72 1576 2634 1052
n xy −  x y
r=
( n x − (  x ) ) ( n y − (  y ) )
2 2 2 2
Answer Cont,D.
5(1576) − 112(72)
r=
( )( )
 5(2634) − (112) 2 5(1052) − (72) 2 
 
−184 −184
= = = −0.843
626 − 66 560

This means that there is a strong negative linear


relationship between student’s absence and marks
4th Method: Spearman’s coefficient of rank correlation
• A Spearman coefficient of rank correlation or Spearman’s rho is a
measure of statistical dependence between two variables. It assesses
how well the relationship between two variables can be described
using a monotonic function. It is also known as a method of ranking.
• The Spearman’s coefficient of rank correlation is denoted and defined
by:
𝒌 𝟐
𝟔 σ𝒊=𝟏 𝒅𝒊
𝝆=𝟏−
𝒏(𝒏𝟐 − 𝟏)
Where, 𝒅𝒊 refers to the difference of ranks between paired items in
two series and n is the number of observations..
Example:
Calculate the Spearman’s coefficient of rank correlation for the series

Solution:
First order the values of 𝑥 and values of 𝑦 from lowest to the highest
𝒙: 7; 8; 9; 10; 12; 12; 12; 12; 16; 16
𝒚: 4; 5; 6; 6; 7; 7; 8; 10; 1 0; 13
Then assign ranks from 1 to 10 for each data in data set
X: 7; 8; 9; 10; 12; 12; 12; 12; 16; 16
1 ; 2; 3; 4; 5; 6; 7; 8; 9; 10
Y: 4; 5; 6; 6; 7; 7; 8; 10; 1 0; 13
1 ; 2; 3; 4; 5; 6; 7; 8; 9; 10
Numbers in red are ranks. we need to find the average of those ranks
which is given by the average of their positions
5+6+7+8
e.g: the rank of 12 in data of x is given by = 6.5
4
This means that each 12 will be ranked 6.5
Also 16 appears 2 times on position 9 and 10. then rank of 16 given by
9+10
= 9.5 . i.e each 16 will be ranked 9.5
2
Take
𝑑 2 = 𝑑𝑖2
• Therefore,
6 σ10 𝑑
𝑖=1 𝑖
2
6×61 990−366
•𝜌 =1− , 𝜌 =1− =
𝑛(𝑛2 −1) 10 100−1 990

• Therefore, 𝜌 = 0.63

Example 2: Calculate the Spearman’s coefficient of rank correlation


for the series
6 σ6𝑖=1 𝑑𝑖 2
Therefore, 𝜌 =1−
𝑛(𝑛2 −1)

6×3
𝜌=1−
6 36 − 1

18
𝜌 =1−
210

𝜌 = 0.91
2.3 Regression Analysis
▪ After having established the fact that two variables are closely
associated, one may be interested in estimating the value of one
variable given the value of another variable.
▪ Regression analysis reveals the average relationship between two variables
and makes possible to estimate or predict the variate value under study.
▪In mathematics Y is called a function of X, but in statistics it is
termed as regression which describes relationship.
➢Regression is the study of functional relationship between two variables of
which one is dependent (Y) and another is independent (X) using equation.
• Example: Yield of crop and quantity of fertilizer
Crop yield and Quantity of rain (check on example 1 on scatter plot)
2.3.1 Regression line and regression coefficient
• Formulas for regression line:
𝒀𝒆𝒔𝒕 = 𝒂 + 𝒃𝒚𝒙 𝑿
Where;
σ𝑿 σ𝒀
෍ 𝑿𝒀 − 𝒏 𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦
𝑎 = 𝑌ത − 𝑏𝑦𝑥 𝑋ത and 𝒃𝒚𝒙 = or 𝑏 =
෎𝑿𝟐
σ𝑿
− 𝒏
𝟐
𝑛 σ 𝑥2 − σ 𝑥 2

• Interpretation:

➢𝒂 is also called y intercept. It is the value of y when x is zero

➢ The value of Slope (𝑏) is interpreted as: On average, One unit increase in X increases (If it is
positive) Y by (𝑏) b unit. If (𝒃) is negative we say Decreases.
Regression line and coefficient interpretation, Cont’d;
➢If the value of the correlation coefficient is significant, the next step
is to determine the equation of the regression line,
➢ Purpose of the regression line is to enable the researcher to see
the trend and make predictions on the basis of the data.
➢ In addition, linear regression fit a straight line,𝒀 = 𝒂𝒙 + 𝒃, to data
that gives best prediction of y for any value of x. This will be the line
that minimises distance between data and fitted line, i.e. the
residuals known as line of Best Fit
➢ Given a scatter plot, you must be able to draw the line of best fit.
Best fit means that the sum of the squares of the vertical distances
from each point to the line is at a minimum.
Regression line and coefficient interpretation, Cont’d;
The reason you need a line of best fit is that the values of y will be
predicted from the values of x; hence, the closer the points are to the
line, the better the fit and the prediction will be.
Example of best fit line:
y = ax + b
slope intercept

ε = residual error
Regression line and coefficient interpretation, Cont’d;
Note that:
• When r is positive, the line slopes upward and to the right.
• When r is negative, the line slopes downward from left to right.

More clarification on Determination of the Regression Line Equation

• In algebra, equation of a line is usually given as 𝑌 = 𝑚𝑥 + 𝑏


where m is the slope of the line and b is the y- intercept.
• In statistics, the equation of simple linear regression line is written as
𝒀 = 𝒂𝒙 + 𝒃, where Y is the response (dependent) variable, X is the
predictor (independent) variable, 𝒂 is the estimated slope, and 𝑏 is the
estimated intercept.
Example 1.

From the following data on Number of employees (months) and average


daily income of five companies

Number of employees(X) 6 7 4 5 8
Average daily income (Y) 90 110 50 80 100

(i) Calculate the correlation coefficient and interpret its value


(ii) Obtain the regression equation of Y on X and interpret the slope
coefficient
(iii) Predict the average daily income for a company with 10 employees
(iv) Calculate and interpret the coefficient of determination (Correlation
Coefficient squared)
Answer 1 (i) Correlation Coefficient
No 𝒙 𝒚 𝒙𝒚 𝒙𝟐 𝒚𝟐
1 6 90 540 36 8100
2 7 110 770 49 12100 n xy −  x  y
r=
3
4
4
5
50
80
200
400
16
25
2500
6400 ( n x − (  x ) ) ( n y − (  y ) )
2 2 2 2

5 8 100 800 64 10000

Total 30 430 2710 190 39100

𝟓 ∗ 𝟐𝟕𝟏𝟎 − 𝟑𝟎 ∗ (𝟒𝟑𝟎)
𝒓= = 𝟎. 𝟖𝟗
𝟓 𝟏𝟗𝟎 − 𝟑𝟎 𝟐 𝟓 𝟑𝟗𝟏𝟎𝟎 − 𝟒𝟑𝟎 𝟐
Answer 1 (ii) Regression line
No 𝒙 𝒚 𝒙𝒚 𝒙𝟐 𝒚𝟐
1 6 90 540 36 8100
2 7 110 770 49 12100 𝟑𝟎
3 4 50 200 16 2500 ഥ=
𝒙 =𝟔
𝟓
4 5 80 400 25 6400
𝟒𝟑𝟎
5 8 100 800 64 10000 ഥ=
𝒚 = 𝟖𝟔
𝟓
Total 30 430 2710 190 39100

𝟓 𝟐𝟕𝟏𝟎 − 𝟑𝟎 ∗ 𝟒𝟑𝟎 𝟏𝟑𝟓𝟓𝟎 − 𝟏𝟐𝟗𝟎𝟎 𝟔𝟓𝟎


𝑏= 𝟐 𝟐
= = = 𝟏𝟑
𝟏𝟗𝟎 − 𝟑𝟎 𝟗𝟓𝟎 − 𝟗𝟎𝟎 𝟓𝟎
𝒂 = 𝟖𝟔 − 𝟏𝟑 𝟔 = 𝟖
Answer 1 (ii) cont’d Regression line
The regression model is:
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑑𝑎𝑖𝑙𝑦 𝑖𝑛𝑐𝑜𝑚𝑒 𝑌 = 8 + 13 ∗ 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑚𝑝𝑙𝑜𝑦𝑒𝑒𝑠(𝑋)

Answer 1 (iii) Interpretation to predict the average daily income for a


company with 10 employees :
➢ On average, the increase of one employee, increases average daily income of
company by 13 units.
➢For x=10, y=8+13*10=138
➢The company with 10 employees is expected to have daily income of 138 units
Answer 1 (iv) Coefficient of Determination (𝑹𝟐 )

𝑅2 = 𝑟 2 = (0.89)2 = 0.79 = 79%

Interpretation:

About 79% of daily income of the company is explained by number of


employees of a company using the model.

Note that R-Squared varies between 0 and 1 and higher value


indicates that the model fit well the data.
2.3.2 Least-Squares Regression Line
The Least Squares Regression Line (LSRL):
LSRL is the line that minimizes the sum of residuals squared.

Equation of the LSRL is given by


the formula 𝒚 ෝ = 𝒂 + 𝒃𝒙
Where;
𝑦ො is the predicted 𝑦 − 𝑣𝑎𝑙𝑢𝑒 given 𝑥
𝛿𝑦
and 𝑏 = 𝑟 and 𝑎 = 𝑌ത − 𝑏𝑋,

𝛿𝑥
𝛿𝑥 and 𝛿𝑦 are the standard deviations of the two variables, and r is their
ෝ , on the diagram is called residual
correlation. The value in red, 𝒚 − 𝒚
Facts about Least-Squares Regression

• The slope b and the correlation r always have the same sign.
• If the correlation r is positive, the slope will be positive and the line
will move upward.
• If the correlation r is negative, the slope will be negative and the line
will move downward.
• The LSRL always passes through 𝑥,ҧ 𝑦ത , where 𝑥,ҧ 𝑦ത is the mean of x
and the mean of y intersect.
• The square of the correlation, 𝑟 2 , is the fraction of the variation in
the values of y that is explained by the least-squares regression of y
on x. this 𝑟 2 is also called the coefficient of determination.
Regression line can be also written as:

We may write:
Note that the regression line 𝑥 on 𝑦 is 𝑥 = 𝑐𝑦 + 𝑑 given by

This line is written as:


• To abbreviate the calculations, the two regression lines can be
determined as follows:
NB: One can also find the equation of a straight line 𝑦 = 𝑎𝑥 + 𝑏
using least square methods, when normal equations are provided
Example 2:
More exercises to come
2nd MAT1243 GROUP ASSIGNMENT (NON-SUBMITTABLE ONE)

Check also the annex 2


(for further unit 2 exercises for you to be prepared
for next week quiz, just on 23 Nov 2023)
rd

81

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy