R08 Multiple Regression and Machine Learning

R08 Multiple Regression and Machine Learning 2019 Level II Notes
R08 Multiple Regression and Machine Learning

1. Introduction
In the previous reading we looked at regressions with one independent variable (X). In this
reading, as the name implies, we will look at regressions with multiple independent
variables (X1, X2, X3…). In section 7, we will also cover machine learning.
2. Multiple Linear Regression

Multiple regression allows us to determine the effect of more than one independent variable
on a particular dependent variable. The multiple linear regression model is given by the
following equation:
Yi = b0 + bi X1i + b2 X 2i + εi , i = 1, 2,….n
where:
Yi = the ith observation of the dependent variable
Xji = the ith observation of the independent variable Xj, j = 1, 2 …k
b0 = the intercept term
b1,….,bk = slope coefficients for each of the independent variables
εi = the error term of the ith observation
n = the number of observations
The slope coefficient bj measures how much the dependent variable Y changes when the
independent variable, Xj, changes by one unit holding all other independent variables
constant.
Example: Explaining the Bid-Ask Spread
This is based on example 1 from the curriculum.
bid−ask spread
This example shows how multiple regression analysis is used in practice. The stock price
of a stock affects a firm’s trading costs. This regression model uses the data from 2,587
NASDAQ-listed stocks to see how the percentage bid-ask spread (dependent variable) for a
stock varies with the number of market makers and the company’s stock market
capitalization (two independent variables). The results of regressing ln (bid-ask
spread/price) on ln (number of market makers) and ln (market cap) are presented below:
Coefficient Standard Error t-Statistic

Intercept 1.5949 0.2275 7.0105
In (Number of NASDAQ market makers) -1.5186 0.0808 -18.7946
In (company’s market cap) -0.3790 0.0151 -25.0093
© IFT. All rights reserved 1

ANOVA df SS MSS F Significance F

Regression 2 3,728.1334 1,864.0067 2,216.75 0.00
Residual 2,584 2,172.8870 0.8409
Total 2,586 5,901.0204
Residual standard error 0.9170
Multiple R-squared 0.6318
Observations 2,587
Given the data, test the hypothesis that NASDAQ stocks’ percentage bid-ask spreads are
related to the number of market makers and the company’s stock market capitalization
using multiple regression analysis.
Yi = b0 + bi X1i + b2 X2i + εi
where:
Yi = the natural logarithm of bid-ask spread/stock price for stock i.
X1i = the natural logarithm of the number of NASDAQ market makers for stock i
X21 = the natural logarithm of the market capitalization (measured in millions of dollars) of
company i.
This is a log-log regression model as the dependent and independent variables are expressed
as natural logarithms; this example also highlights how transformation is used in a
regression equation where Y and X data are not used as it is. For example, a stock may have a
market cap of $500 million, but in the regression equation it is transformed into ln (500) =
6.214. A log-log regression model may be appropriate when one believes that proportional
changes in the dependent variable bear constant relationship to changes in the independent
variables.
Solution:
Look at the ANOVA table to determine if the regression is overall significant. A higher F-
statistic is better as the regression model does a good job of explaining the variation in the
dependent variable. The high F-stat value of 2,216 at a low p-value of 0.00 implies we can
reject the null hypothesis that all the coefficients are zero.
Next, look at the coefficients in the table. They are all negative, but are they significant?
Using the estimated values of b0, b1, and b2 from the table, the regression equation can be
written as:
Y = 1.5949 – 1.5186X1 - 0.3790X2 + ε
where:
X1 = ln (market makers) and X2 = ln (market cap)
Perform the t-test for the coefficients.
We will first formulate the null and alternative hypothesis for the two independent variables.

Set up the null hypothesis such that you can reject it.
Hypothesis testing: set up a one-tailed t-test
ln (Number of market ln (Market capitalization of
Independent variable
makers) stocks)
Null hypothesis: H0 b1 ≥ 0 b2 ≥ 0
Alternative hypothesis: Ha b1 < 0 b2 < 0
If we reject the null hypothesis and accept the alternative hypothesis, then the following will
be proven true:
• The greater the number of market makers, the smaller the percentage bid-ask spread.
• Stocks with higher market capitalization have more liquid markets, hence lower bid-ask
spreads and lower trading costs.
We will use a t-test here at 0.01 level of significance. This gives us a critical value of 2.345
The t-value are given in the table: -18.7946 and -25.0093. If this information is not provided
in the table, we can also calculate the t stat using the formula:
b̂j − bj
t=
ŝ
bj
− 1.5186 − 0
For b1: t = = - 18.7946
0.0808
− 0.3790 − 0
For b2: t = = - 25.0093
0.0151
Decision: from the t-tests, we can reject the null hypothesis, since both the t-values of -
18.7946 and -25.0093 are less than -2.345 (the critical tc value).
Conclusion: We can conclude that stocks with a large number of market makers and high
market capitalization have lower bid-ask spreads.
The final section of the table, tells us how well the model fits or explains the data. Multiple R2
of 0.6318, means that the model explains about 63% of the variation of the data.
Calculate the bid-ask spread for a stock with a market capitalization of 100 million and 20
market makers.
Solution:
The regression equation is given by:
Y = 1.5949 – 1.5186X1 - 0.3790X2
Y = 1.5949 – 1.5186 ln20 - 0.3790ln100 = -4.6997
ln (bid-ask spread) = - 4.6997
bid-ask spread = ℮−4.6997 = 0.0091= 0.91%
The predicted value of the bid-ask spread is 0.91%.
2.1 Assumptions of the Multiple Linear Regression Model
Following are the six assumptions of a normal multiple linear regression model:

Instructor’s Note: Memorize the assumptions, you can be tested on them.

1. The relationship between the dependent variable, Y, and the independent variables, X1,
X2… is linear.
2. The independent variables (X1, X2, …Xk) are not random. No linear relation exists
between two or more of the independent variables.
3. The expected value of the error term, conditioned on the independent variables is 0; E
(ε|X1, X2,….Xk) = 0
4. The variance of the error term is the same for all observations: E(ε2i ) = σ2ε
5. The error term is uncorrelated across all observations: E (εi εj ) = 0; j ≠ i
6. The error term is normally distributed.
The estimated regression coefficients are unbiased and consistent when these assumptions
hold good.
2.2 Predicting the Dependent Variable in a Multiple Regression Model
To predict the value of a dependent variable using a multiple linear regression model, follow
these three steps:
1. Obtain estimates b̂0 , b
̂1 , b ̂k of the regression parameters b0, b1, b2,…bk. ̂ indicates
̂2 , … . , b
the values are estimated.
2. Determine the assumed values of the independent variables X̂ ̂ ̂
1i , X 2i , … … , X ki
̂ using the equation
3. Compute the predicted value of the dependent variable, 𝑌_𝑖
̂ ̂0 + b
Yi = b ̂1 ̂
X1i + b̂2 ̂ ̂k ̂
X2i + ⋯ … + b Xki
2.3 Testing whether All Population Regression Coefficients Equal Zero
To test the null hypothesis that all of the slope coefficients in the multiple regression model
are jointly equal to 0 (H0: b1, b2, ……..bk = 0) against the alternative hypothesis that at least
one slope coefficient is not equal to 0, we must use an F-test.
The F-stat is calculated as:
RSS
k mean regression sum of squares MSR
F= SSE = =
mean squared error MSE
[n− (k + 1)]
Let us go back to example 1, and look at the ANOVA section of regression output to see how
the values are calculated.
Regression 2 3,728.1334 1,864.0067 2,216.7505 0.00
Residual 2,584 2,172.8870 0.8409
Total 2,586 5,901.0204
Using the above equation, let us now calculate the values for RSS, SSE, MSR and MSE:
The model has two slope coefficients, therefore there are two degrees of freedom in the

numerator of this F-test. Degrees of freedom in the denominator = n – (k + 1) = 2,587 – 3 =

2,584.
Therefore the F-stat for this regression is:
3,728.1334/2
F= = 2,216.7505
2,172.8870/2,584
The critical value of F-statistic needed to reject the null hypothesis is between 3.00 and 3.07.
Since the actual F-statistic value of 2,216 is greater than the critical F-value, we can reject the
null hypothesis that the coefficients of the two independent variables equal 0.
2.4 Adjusted R2
With one independent variable, the coefficient of determination, R2 measures the goodness
of fit of an estimated regression to the data. In multiple regression, R2 increases as we add
new independent variables even if the amount of variation explained by them is not
statistically significant.
In multiple linear regression, R2 is less useful; instead, an alternative measure called
adjusted R2 or 𝐑̅ 𝟐 is used, which indicates the percentage of variation explained by only
those independent variables that are statistically significant (pass the t-test).
Relation between R2 and 𝐑 ̅𝟐
n−1
̅2 = 1 − (
R ) (1 − R2 )
n−k−1
where:
n = number of observations
k = number of independent variables
As k increases, initially ̅̅̅
R2 increases but beyond a point it decreases if the added variable
results only in a small increase in R2.
Practice question:
Three independent variables together explain 85% of the variation in Y. Adjusted R2 is also
85%. Create a new model where you add 10 more X variables. R2 = 87 and adjusted R2 = 83.
Is the new model better?
Solution:
No, the model is not better as the adjusted R2 decreases after the 10 independent variables
were added.
3. Using Dummy Variables in Regressions

A dummy variable is an independent variable that can take only binary values; its value is 1
if the condition is true and 0 if the condition is false.

If there are n states of the world, then use n - 1 dummy variables. For example, if an analyst
wants to distinguish between average market returns in January and the rest of the year,
then one dummy variable is used (January or not January).
Example: Month-of-the-year effects on small stock returns
This example tests for the seasonality effect in small stock returns. The monthly returns data
from Jan. 2001 to Dec. 2013 is used. The regression equation uses 11 dummy variables (12
months -1), one for the return for each of the first 11 months in a year. The intercept
measures the returns for Dec.
Returnst = b0 + b1 Jant + b2 Febt + … + b11 Novt + εt
where:
each monthly dummy variable has a value of 1 when the month occurs and a value of 0 in
other months.
Given the results of the regression, test whether the return of the small stock index differs
across months. Can we reject the null hypothesis at 5% significance?
Intercept 0.0273 0.0149 1.8322
January -0.0213 0.0210 1.0143
February -0.0112 0.0210 -0.5333
March 0.0101 0.0210 0.4810
April -0.0012 0.0210 -0.0571
May -0.0425 0.0210 -2.0238
June -0.0065 0.0210 -0.0395
July -0.0481 0.0210 -2.2905
August -0.0367 0.0210 -1.7476
September -0.0285 0.0210 -1.3571
October -0.0429 0.0210 -2.0429
November -0.0339 0.0210 -1.6143

Regression 11 0.0551 0.0050 1.7421 0.0698
Residual 144 0.4142 0.0029
Total 155 0.4693
Observations 156
Solution:

The steps to determine significance for a given regression output are listed below:
1. The null hypothesis is that the slope coefficients are zero, or that the returns are equal
across the months.
2. The overall regression is not significant as R2 is low (0.1174).
3. Refer the F-table to determine the critical F-value at 0.05 significance level. Degrees of
freedom in the numerator: df1 = 11 and degrees of freedom in the denominator: df2 =
144. From the table, the critical F-value is between 1.79 and 1.87. The F-statistic is given
as 1.7421, which is smaller than the critical F value. We cannot reject the null hypothesis
that all the slope coefficients are equal to zero.
4. The p-value at 6.98 percent is higher than 5 percent. This means the smallest level of
significance at which the null hypothesis can be rejected is roughly 7 percent.
5. Analyzing the t-statistics is not required as it will only confirm the results of F-statistic.
From the t-statistic values, it can be seen that the coefficients for May, July, and October
are significant (greater than tc). But, for the remaining months, they are insignificant.
Therefore we cannot reject the null hypothesis that the returns are equal across the
months.
4. Violations of Regression Assumptions

In this section, we look at three violations of multiple regression assumptions that lead to
heteroskedasticity, serial correlation, and multicollinearity. For each of the violations, we
will discuss what it is, what its consequences are, how to detect, and how to correct the
violation.
4.1 Heteroskedasticity
One of the assumptions for a multiple regression model is that the variance of the error term
is constant across observations. When this assumption is violated, we have a condition called
heteroskedasticity, where the variance of the error term varies across observations.
There are two types of heteroskedasticity:
• Unconditional heteroskedasticity: When heteroskedasticity of the error variance is
not correlated with the independent variables. Unconditional heteroskedasticity is
not a problem.
• Conditional heteroskedasticity: The error variance is correlated with the values of
the independent variables. Conditional heteroskedasticity is a problem.
The figure below illustrates conditional heteroskedasticity.

In this model, the regression residuals (or error terms) increase as the size of the
independent variable increases i.e. the error variance depends on the values of the
independent variable.
The Consequences of Heteroskedasticity
• F-test for the overall significance of the regression is unreliable.
• The regression coefficient estimates are fine, but the standard errors are understated.
• Understated standard errors make t-test for the significance of the individual
regression coefficients unreliable as the t-statistic may be overstated. This will lead to
significant relationships where none exist.
How to Test for Heteroskedasticity
The Breusch-Pagan test is widely used to detect heteroskedasticity. It tests the null
hypothesis that the regression’s squared error term is uncorrelated with the independent
variables (no conditional heteroskedasticity). The alternative hypothesis states that the
error term is correlated with the independent variables.
How to Correct for Heteroskedasticity
Two methods are used to correct for errors caused byheteroskedasticity:
• Robust standard errors: Corrects the standard errors of the linear regression
model’s estimated coefficients to account for conditional heteroskedasticity.
• Generalized least squares: Modifies the original equation in an attempt to eliminate
heteroskedasticity.
4.2 Serial Correlation
Serial correlation is also called autocorrelation. In serial correlation, regression errors in
one period are correlated with errors from previous periods. It is often found in time-series

regressions. The assumption that the regression errors are uncorrelated across observations
is violated here.
The Consequences of Serial Correlation
• Like heteroskedasticity, the regression coefficient estimates are fine, but the standard
errors are understated.
• t-stat and F-stat values are too high. This will lead to incorrectly rejecting the null
hypothesis (type I error)
How to Test for Serial Correlation
A common method for testing serial correlation is the Durbin-Watson (DW) statistic. The
formula for DW statistic is:
DW ≈ 2(1 - r)
where:
DW = Durbin-Watson statistic
r = sample correlation between the regression residuals from one period and those from the
previous period
We will not go into the details of the formula, but just a few important points related to how
to interpret the statistic are listed below:
• The DW statistic takes on values ranging from 0 to 4. 0 < DW < 4
• Positive serial correlation: DW < 2
• Negative serial correlation: DW > 2
• When serial correlation = 1, DW = 0
• When serial correlation = -1, DW = 4
It is not possible to determine a single critical value for the Durbin-Watson statistic, but we
can determine if it lies between two values, dl (lower value) and du (upper value), or outside
those values.
Null hypothesis: H0: no positive serial correlation

Alternative hypothesis: Ha: positive serial correlation
How to interpret the Durbin-Watson statistic:
If DW < dl, then reject the null hypothesis.
If dl < DW < du, then the test results are inconclusive.

If DW > du, then we fail to reject the null hypothesis.

How to Correct for Serial Correlation
Serial correlation can be corrected in the following ways:
• Hansen method: Adjust the coefficient standard errors for the linear regression
parameter estimates to account for the serial correlation. This is the recommended
method as it simultaneously corrects for condition heteroskedasticity as well.
• Modify the regression equation itself to eliminate serial correlation.
4.3 Multicollinearity
Multicollinearity occurs when two or more independent variables (or combinations of
independent variables) are highly (but not perfectly) correlated with each other. This is a
violation of the second assumption of multiple regression that no exact linear relation exists
The Consequences of Multicollinearity
The standard errors are inflated so the t-stats of coefficients are artificially small and cannot
reject the null hypothesis.
How to Test for Multicollinearity
A few common methods to detect the presence of multicollinearity include:
• High R2 and significant F-statistic even though the t-statistic on the estimated slope
coefficients are not significant.
• Compute correlations between all pairs of independent variables. However, it is a
matter of degree rather than absence or presence. High correlation does not
necessarily imply multicollinearity is a problem, or absence does not mean
multicollinearity is not a problem.
How to Correct for Multicollinearity
One common solution to correct for multicollinearity is to omit one or more of the
independent variables.
4.4 Heteroskedasticity, Serial Correlation, Multicollinearity: Summarizing the Issues
The three issues arising from assumption violations in a regression are summarized below:
Problem Effect Solution
Heteroskedasticity F-test is unreliable; Robust standard errors
Standard error for coefficients Generalized least
will be underestimated;
t-stat will be overstated;
Serial correlation t-test and F-stat too high; Hansen method
Modify the regression equation

Multicollinearity Inflated SE’s; t-stats of Omit one or more of the

coefficients artificially small independent variables
High R2
5. Model Specification and Errors in Specification

Model specification refers to the selection of the independent variables in the regression,
and the regression equation’s functional form. In this section we look at the attributes of a
good regression model, three types of misspecification errors and their consequences.
5.1 Principles of Model Specification
How a model is specified may vary based on the purpose it is used for. However, there are
some general attributes of a good model specification:
1. The model should be grounded in cogent economic reasoning. This avoids the
spurious correlation issue and data mining bias.
2. The functional form chosen for the variables in the regression should be appropriate
given the nature of the variables. For instance, in Example 1, if we did not do the log-
log transformation of dependent and independent variables, and instead used the
number of market makers as X1 and the market capitalization as X2, then the value of
the bid-ask spread (Y) would have been strange.
3. The model should be parsimonious: the model should be able to explain a lot with a
little.
4. The model should be examined for violations of the regression assumptions before
being accepted. It must be ensured that the regression is free of heteroskedasticity,
serial correlation, and multicollinearity.
5. The model should be tested and be found useful out of sample before being accepted.
The model is based on a sample data. “Out of sample” means observations outside
this sample dataset. Does the model have the same predictive power for out of sample
data?
5.2 Misspecified Functional Form
Whenever we estimate a regression, we must assume the regression has the correct
functional form. This assumption can fail in several ways leading to functional specification
errors in a regression:
• Omitted variables: One or more variables could be omitted from the regression.
This results in the omitted variable bias, where the regression coefficients are
inconsistent (or wrong); the standard errors of those coefficients are also wrong so
the t-tests cannot be used to test significance.
• One or more of the variables may need to be transformed before estimating the
regression. For example, by taking the natural logarithm of the variable.
• The regression model pools data from different samples that should not be pooled.

For example, the scatter plot below shows there is no correlation between the
variables X and Y in each cluster. For the combined data, however, there is a high
correlation between X and Y which is misleading. This is because the data is drawn
from two different samples that should not have been pooled.
5.3 Time- Series Misspecification (Independent Variables Correlated with Errors)

Regression assumption#3 states that the error term has an expected value of 0. When
working with time series data, this assumption is frequently violated, which causes the
estimated regression coefficients to be biased and inconsistent.
Instructors note: This is different from the issues we saw in the previous sections, where
the t-stats were biased.
Three problems that create this type of time-series misspecification are:
1. Including lagged dependent variables as independent variables in regressions with
serially correlated errors. The equation is of the form: Yt = b0 + b1X1t + b2Yt–1 + εt. This
violates the assumption that the independent variable is not correlated with the error
term.
2. Including a function of a dependent variable as an independent variable, sometimes
as a result of the incorrect dating of variables.
3. Independent variables that are measured with error. The example below showcases
this problem.
Example: The Fischer Effect with Measurement Error
The Fischer effect relates the expected inflation to the nominal interest rate. The results of
regressing T-Bill returns on predicted inflation are presented below:
Intercept 0.0116 0.0033 3.5152
Inflation prediction 1.1744 0.0761 15.4323


Observations 181
Durbin-Watson statistic 0.2980
Can actual inflation be used as the independent variable instead of expected inflation?
Solution:
Actual inflation = expected inflation + error
Since actual inflation measures expected inflation with error, the estimators of the slope
coefficient on actual inflation will be biased and inconsistent. This is confirmed by the
regression result below. As you can see, the slope coefficient of 0.8946 is much lower than
1.1744 on predicted inflation.
Intercept 0.0227 0.0034 6.6765
Actual inflation 0.8946 0.0761 11.7556
Observations 181
5.4 Other Types of Time- Series Misspecification
The most frequent source of misspecification in linear regressions that use time series from
two or more different variables is non-stationarity. Non-stationarity means a variable’s
properties, such as mean and variance, are not constant through time.
Situations where we need to use stationarity tests before we use statistical inference
include:
• Relations among time series with trends (for example, the relation between
consumption and GDP).
• Relations among time series that may be random walks (time series for which the
best predictor of next period’s value is this period’s value). Exchange rates are often
random walks.
Instructors note: We will cover these concepts in detail in the next reading.
6. Models with Qualitative Dependent Variables

Qualitative dependent variables are dummy variables used as dependent variables instead of
independent variables. For example, to predict whether a company will go bankrupt or not,
we need to use a qualitative dependent variable (bankrupt or not) as the dependent variable
and financial performance of the company (e.g. return on equity, debt-to-equity ratio) as
independent variables.
In these situations, linear regression model is not the right estimation model to predict a

discrete outcome. In the bankruptcy example, the outcome we are expecting to make a
decision is 0 or 1. But, it is highly unlikely that we will get a discrete outcome using a linear
regression. The predicted value will be either greater than 1 or less than 0. Since the
probability cannot be greater than 100 percent or less than 0 percent, a model that gives a
discrete outcome such as probit/logit models will be an appropriate choice.
Probit and logit models estimate the probability of a discrete outcome given the values of
the independent variables used to explain that outcome.
The probit model, which is based on the normal distribution, estimates the probability that Y
= 1 (a condition is fulfilled) given the value of the independent variable X.
The logit model is identical, except that it is based on the logistic distribution rather than a
normal distribution.
Discriminant analysis yields a linear function that is used to create an overall score. This
score is then used to classify an observation into the bankrupt or non-bankrupt category.
7. Machine Learning
So far we have focused on multiple linear regression, where we work with relatively small
and well organized data. However, if we are working with ‘Big Data’, these techniques are
not useful. Big Data has the following characteristics:
• Volume: There is an immense volume of data as compared to traditional data.
• Variety: The data may be structured or unstructured. We may have text or video
based data.
• Velocity: We may have to analyze data on real time basis.
To analyze Big Data we have to use machine learning. An important concept under machine
learning is ‘data analytics’ where we extract information from Big Data, which gives us
actionable insights.
7.1 Major Focuses of Data Analytics
There are six major focuses of data analytics:
1. Measuring correlations: This focuses on understanding the contemporaneous
relationships between variables. Contemporaneous means that we are working with
variables at the same point in time. For example, we might be interested in measuring
the correlation between Microsoft returns and the returns of the S&P 500 index.
2. Making predictions: Here we try to determine if one variable (X) can be used to
predict another variable (Y). For example, we might measure the number of social
media posts related to Apple and try to predict the movement in Apple’s stock price
based on this information.
3. Making causal inferences: Here we try to determine if a change in an independent
variable (∆X) causes a change in a dependent variable (∆Y). A causal relationship

implies that there is some underlying mechanism that connects the dependent
variable with the independent variable. A causal relationship is a stronger
relationship as compared to correlations and predictions.
4. Classifying data: This focuses on sorting data into distinct categories. For example,
we may consider millions of credit card transactions and try to determine whether a
transaction is fraudulent or not. Here, our classification model is binary. It is also
possible to perform multi-category classification, for example, when assigning credit
ratings.
5. Sorting data into clusters: This involves sorting data into clusters, such that
observations within the same cluster are similar to each other and are different from
observations in other clusters. The basis for creating these clusters may or may not
be specified in advance. For example, an equity analyst may divide a group of stocks
into clusters based on the risk-return characteristics of the stocks rather than
traditional classification systems such as sectors or industries.
6. Reducing the dimension of data: This requires reducing the number of independent
variables (X), while retaining the predictive power of the model. For example, let’s
say we initially have a model with 10 independent variables (X1, X2,… X10) that predict
a dependent variable (Y). Assume that by a dimension reduction exercise, we
determine that just three independent variables (X1, X2, X3) predict Y just as well.
Dimension reduction is important because it helps us identify the variables that have
the biggest relationship with the independent variable. Also, it has been shown that
simpler models with fewer independent variables work better than complex models
for out-of-sample forecasting.
Historically, the six techniques mentioned above have been used with structured data. But in
recent times with the developments in machine learning, these techniques are also being
used with Big Data.
7.2 What Is Machine Learning?
Machine learning (ML) is a subset of artificial intelligence (AI). In ML, machines display
intelligent decision making abilites through activities such as sensing, reasoning, and
understanding and speaking languages. For example, Apple’s digital assistant Siri can be
considered a machine learning program. In machine learning, machines are programmed to
improve performance in specified tasks with experience. A formal definition of machine
learning is provided below:
‘A computer program is said to learn from experience E with respect to some class of tasks T
and performance measure P if its performance at tasks in T, as measured by P, improves
with experience E.’
We can understand this definition through a simple example. Let’s say we want to apply

machine learning to determine whether a credit card transaction is fraudulent or not. Here
the terms can be defined as:
• T (task): Predict whether credit card transactions are fraudulent or not
• P (performance measure): Percentage of transactions correctly predicted
• E (experience): Number of credit card transactions that the program works with.
Let’s say that for the first 1 million transactions the performance measure was 80%, which
means that while predicting whether a credit card transaction was fraudulent or not, our
program was 80% accurate.
However, for the next 1 million transactions, when the experience level has reached a total of
2 million transactions, we observe that the performance has improved to 90%. This means
that the program has learnt through experience and improved its performance. This is an
example of machine learning.
7.3 Types of Machine Learning
Broadly speaking, we have two types of machine learning: supervised learning and
unsupervised learning.
Supervised learning is machine learning that makes use of labelled training data. A formal
definition is: ‘Supervised learning is the process of training an algorithm to take a set of
inputs X and find a model that best relates them to the output Y.’
In the credit card example, the ML program is given training data, which may consist of
several hundred transactions of different amounts, different origins etc. These transactions
are labelled as ‘fraudulent’ or ‘not fraudulent’. The ML program learns from this labelled
training data and can predict whether new transactions are fraudulent.
Supervised learning is useful for classifications and predictions.
Unsupervised learning is machine learning that does not make use of labeled training data.
Several input variables are used for analysis but no output (or target variable) is provided.
Because tagged data is not provided, the ML program has to discover structure within the
data on its own. An example of where unsupervised learning is applied is clustering. For
example, a program may be given financial information about companies and the program
can cluster companies into groups based on their attributes.
Unsupervised learning is also used for dimension reduction where we want to reduce the
number of independent variables in a model.
Machine learning vocabulary: In multiple regression, we call our X variables the
independent variables and the Y variable the independent variable. In machine learning, the
Y variable is also called the ‘tag’ variable or the ‘target’ variable and the X variables are called
‘features’.

7.4 Machine Learning Algorithms

The following figure shows the different types of machine learning algorithms.
Figure 7 from the curriculum.
Supervised learning
Supervised learning can be divided into two categories: regression and classification. In
regression the target variable (Y) is continuous. In classification, the target variable (Y) is
either categorical (categories cannot be ranked) or ordinal (categories can be ranked for
example credit ratings).
Let us now look at the specific types of supervised learning algorithms.
Penalized Regression:
• It is a computationally efficient technique used in prediction problems.
• It is a special case of the generalized linear model. In a GLM model, we can specify
how the model is calibrated and the model’s parsimony (using as few X variables as
possible)
• Penalized regression addresses the overfitting problem through a process called
regularization. Overfitting occurs because of excessive data mining of training data
and the model does not work well with out of sample data.
• The regression coefficients are chosen to minimize sum of squared residuals plus a
penalty term that increases with the number of included variables with non-zero
coefficients. Because of this penalty, the model remains parsimonious and only the
most important variables for explaining Y remain in the model.
Classification and Regression Trees (CART)
• It is a computationally efficient technique that is adaptable to datasets with complex
structures.
• Cart can be applied to predict either a categorical or continuous target variable.
▪ If we are predicting a categorical target variable, then a classification tree is
produced.
▪ Whereas, if we are predicting a continuous outcome, then a regression tree is
produced.
• It is commonly applied for classification problems and when target is binary. For
example, classifying credit card transactions as fraudulent or not fraudulent.

• CART accounts for non-linear relationships among variables (features).

Instructor’s note: If we are dealing with non-linear relationships, CART can be better
than the probit and logit models we studied earlier.
• A CART algorithm is based on a tree-structure of nodes. It produces decision trees
with binary branching to classify observations. An example is shown below:
Figure 8 in the curriculum.
Random Forests
• A random forest classifier is a collection of classification trees.
• Instead of just one classification tree, several classification trees are built based on
random selection of features (variables).
• Random forests protect against overfitting on the training data.
• Random forests are an example of ensemble learning, whereby signal-to-noise ratio is
improved, because errors cancel each other out across the collection of classification
trees.
Neural Networks (Artificial Neural Networks)
• Neural networks are applied to tasks characterized by non-linearities and
interactions among variables.
• Neural networks have layers of nodes connected by links:
▪ Input layer nodes correspond to features used for prediction
▪ Hidden layer(s) feeds an output node
▪ Output node generates predicted value
The following figure illustrates a neural network.

• Learning takes place through improvements in weights applied to nodes.

• Neural networks with more than 20 hidden layers are called ‘deep learning nets
(DLNs)’.
Note: Neural networks are used for both supervised and unsupervised learning.
Unsupervised learning:
In unsupervised learning we do not make use of labelled training data. The major types of
unsupervised learning algorithms are:
Clustering Algorithms
• Clustering groups data only on the basis of information contained in the data.
▪ A clustering algorithm uncovers structures in data without the use of labels.
▪ This is different from classification where data are assigned to classes
determined by the researcher.
• Clustering approaches are either bottom-up clustering or top-down clustering.
▪ With a bottom-up approach we start with each observation being its own
cluster, and then group the observations into larger, non-overlapping clusters
based on some characteristics.
▪ With a top- down approach, we start with all observations belonging to a
single cluster, and then partition this into smaller and smaller clusters.
• One example of a clustering algorithm is the ‘K-means’ algorithm. This is a bottom-up
clustering algorithm. It is based on two concepts:
▪ Centroid – the central point of each cluster.
▪ Metric measure of distance between two points.
▪ The core idea is – the centroids should be selected such that the distances of

the observations from the centroid is minimized.

Dimension Reduction
• Here we want to reduce set of features to a manageable size while retaining as much
of the variation in data as possible.
• Principal component analysis (PCA) is one type of a dimension reduction method.
▪ Here we come up with the first principal component which is the most volatile.
It represents the most important factor for explaining the volatility of data.
▪ Then we come up with the second principal component which extracts as
much of the remaining volatility as possible.
▪ Additional principal components are added as needed.
▪ However, an important restriction is that the principal components are
uncorrelated with each other.
7.5 Supervised Machine Learning: Training
It is important to understand the difference between emphasis of model specification and
emphasis of machine learning.
• In model specification, the emphasis is to connect the model to economic reasoning.
• In machine learning, the focus is not on economic reasoning, instead the emphasis is
on improving prediction accuracy.
Process to train ML models:
This process involves five steps:
1. Specify ML technique or algorithm. For example, we may select the penalized
regression model.
2. Specify the associated hyperparameters. These are X variables, that may be needed
for the model selected in step 1.
3. Divide the data into training sample and validation sample. We will first work with
the training sample to come up with the model. The model will be tested on the
validation sample, to make sure that it works well on an out-of-sample data.
4. Evaluate learning with performance measure and tune hyperparameters. For
example, after testing on the validation sample we may learn that some
hyperparameters do not have much predictive power and decide to drop them.
5. Repeat training cycle till required performance level is obtained.
A point of caution here is: Too many training cycles (if chosen), may result in an over-fitting
problem.
An important term related to this process is ‘cross validation’. In cross validation, we move
some of the data from the training sample to the validation sample and vice versa and run
the training cycle again. Cross validation helps to improve the quality of our machine
learning model.

Summary
LO:a. Formulate a multiple regression equation to describe the relation between a
dependent variable and several independent variables and determine the statistical
significance of each independent variable.
A multiple regression allows us to determine the effect of more than one independent
variable on a particular dependent variable.
A multiple regression model is given by:
Yi = b0 + bi X1i + b2 X 2i + εi , i = 1,2,….n
LO:b. Interpret estimated regression coefficients and their p-values.
The slope coefficient bj measures how much the dependent variable Y changes when the
independent variable, Xj, changes by one unit holding all other independent variables
constant.
The lower the p-value for a test, the more significance the result.
LO:c. Formulate a null and an alternative hypothesis about the population value of a
regression coefficient, calculate the value of the test statistic, and determine whether
to reject the null hypothesis at a given level of significance.
H0: bj = 0
Ha: bj ≠ 0
̂
bj − bj
The t-test (t = ) is used to determine the statistical significance of the population value
sb̂
j
of a regression coefficient.
LO:d. Interpret the results of hypothesis tests of regression coefficients.
If the t-statistic is more that the upper critical t-value (or less than the lower critical t-value)
then we can reject the null hypothesis and conclude that the regression coefficient is
statistically significant.
LO:e. Calculate and interpret 1) a confidence interval for the population value of a
regression coefficient and 2) a predicted value for the dependent variable, given an
estimated regression model and assumed values for the independent variables.
Confidence Interval = b̂j ± (critical t-value) (SE of the coefficient)
̂0 + b
̂i = b
Predicted value Y ̂1 X
̂1i + b̂2 X ̂k X
̂ 2i + ⋯ … + b ̂ ki
LO:f. Explain the assumptions of a multiple regression model.

• The relationship between the dependent variable, Y, and the independent variables, X1,
X2… are linear.
• The independent variables (X1, X2,…, Xk) are not random. No linear relation exists


• The expected value of the error term, conditioned on the independent variables is 0.
• The variance of the error term is the same for all observations.
• The error term is uncorrelated across all observations.
• The error term is normally distributed.
LO:g. Calculate and interpret the F-statistic, and describe how it is used in regression
analysis.
The F-test is reported in an ANOVA table. It is used to determine the significance of the
regression as a whole. The null hypothesis is that all of the slope coefficients in the multiple
regression model are jointly equal to 0. The F-statistic is used to determine if one of the slope
coefficients is significantly different from 0.
RSS
k mean regression sum of squares MSR
F= SSE = =
mean squared error MSE
[n− (k + 1)]
F-test is a one-tailed test; the decision rule is to reject the null hypothesis if F > Fc
LO:h. Distinguish between and interpret the R2 and adjusted R2 in multiple regression.
• R2 measures the percentage of variation in Y that is explained by the independent
variables.
• In multiple regression, R2 increases as we add new independent variables even if the
amount of variation explained by them is not statistically significant.
• Hence, adjusted R̅2 is used because it does not necessarily increase when an independent
variable is added.
LO:i. Evaluate how well a regression model explains the dependent variable by
analyzing the output of the regression equation and an ANOVA table.
Please refer to example 1 to understand the different sections of an ANOVA table.
LO:j. Formulate a multiple regression equation by using dummy variables to
represent qualitative factors and interpret the coefficients and regression results.
• A dummy variable is an independent variable that can take only binary values.
• Its value is 1 if the condition is true and 0 if the condition is false.
• To distinguish between (n – 1) categories, the regression must use (n - 1) dummy
variables.
LO:k. Explain the types of heteroskedasticity and how heteroskedasticity and serial
correlation affect statistical inference.
LO:l. Describe multicollinearity and explain its causes and effects in regression
analysis.

Problem Effect Solution

Heteroskedasticity F-test is unreliable; Robust standard errors
Standard error for coefficients Generalized least
will be underestimated;
t-stat will be overstated;
Serial correlation t-test and F-stat too high; Hansen method
Modify the regression equation
Multicollinearity Inflated SE’s; t-stats of Omit one or more of the
coefficients artificially small independent variables
High R2
LO:m. Describe how model misspecification affects the results of a regression analysis
and describe how to avoid common forms of misspecification.
• Model specification refers to the selection of the independent variables in the regression,
and the regression equation’s functional form. The general principles of a good model
specification are:
o The model should be grounded in cogent economic reasoning.
o The functional form chosen for the variables in the regression should be
appropriate given the nature of the variables.
o The model should be parsimonious.
o The model should be examined for violations of the regression assumptions
before being accepted.
o The model should be tested and be found useful out of sample before being
accepted.
• If a regression is misspecified, then the estimated regression coefficients may be
inconsistent and statistical inference invalid.
• Three types of functional specification errors in a regression are:
o Omitted variables
o Not transforming the variables before using in a regression
o Pooling data from different samples that should not have been pooled
LO:n. Describe models with qualitative dependent variables.
• Qualitative dependent variables are dummy variables used as dependent variables
instead of independent variables.
• Probit and logit models estimate the probability of a discrete outcome given the values of
the independent variables used to explain that outcome.
LO:o. Evaluate and interpret a multiple regression model and its results.
Please refer to example 1.

LO:p. Distinguish between supervised and unsupervised machine learning.

In supervised learning we use of labelled training data.
Whereas, in unsupervised learning we do not use labelled training data. The ML program has
to discover structure within the data on its own.
LO:q. Describe machine learning algorithms used in prediction, classification,
clustering, and dimension reduction.
The following figure shows the different types of machine learning algorithms.
LO:r. Describe the steps in model training.

This process consists of five steps:
1. Specify the ML technique/algorithm.
2. Specify the associated hyperparameters.
3. Divide data into training and validation samples.
4. Evaluate learning with performance measure P, using the validation sample, and tune the
hyperparameters.
5. Repeat the training cycle the specified number of times or until the required
performance level is obtained.

R08 Multiple Regression and Machine Learning

Uploaded by

Copyright:

Available Formats

R08 Multiple Regression and Machine Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

R08 Multiple Regression and Machine Learning

Uploaded by

Copyright:

Available Formats

R08 Multiple Regression and Machine Learning 2019 Level II Notes

R08 Multiple Regression and Machine Learning

2. Multiple Linear Regression

Coefficient Standard Error t-Statistic

© IFT. All rights reserved 1

ANOVA df SS MSS F Significance F

© IFT. All rights reserved 2

© IFT. All rights reserved 3

Instructor’s Note: Memorize the assumptions, you can be tested on them.

© IFT. All rights reserved 4

numerator of this F-test. Degrees of freedom in the denominator = n – (k + 1) = 2,587 – 3 =

3. Using Dummy Variables in Regressions

© IFT. All rights reserved 5

ANOVA df SS MSS F Significance F

© IFT. All rights reserved 6

4. Violations of Regression Assumptions

© IFT. All rights reserved 7

© IFT. All rights reserved 8

Null hypothesis: H0: no positive serial correlation

© IFT. All rights reserved 9

If DW > du, then we fail to reject the null hypothesis.

© IFT. All rights reserved 10

Multicollinearity Inflated SE’s; t-stats of Omit one or more of the

5. Model Specification and Errors in Specification

© IFT. All rights reserved 11

5.3 Time- Series Misspecification (Independent Variables Correlated with Errors)

© IFT. All rights reserved 12

Residual standard error 0.0223

6. Models with Qualitative Dependent Variables

© IFT. All rights reserved 13

© IFT. All rights reserved 14

© IFT. All rights reserved 15

© IFT. All rights reserved 16

7.4 Machine Learning Algorithms

© IFT. All rights reserved 17

• CART accounts for non-linear relationships among variables (features).

© IFT. All rights reserved 18

Figure 9 from the curriculum.

• Learning takes place through improvements in weights applied to nodes.

© IFT. All rights reserved 19

the observations from the centroid is minimized.

© IFT. All rights reserved 20

LO:f. Explain the assumptions of a multiple regression model.

© IFT. All rights reserved 21

between two or more of the independent variables.

© IFT. All rights reserved 22

Problem Effect Solution

© IFT. All rights reserved 23

LO:p. Distinguish between supervised and unsupervised machine learning.

LO:r. Describe the steps in model training.

© IFT. All rights reserved 24

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.