R08 Multiple Regression and Machine Learning
R08 Multiple Regression and Machine Learning
R08 Multiple Regression and Machine Learning
Set up the null hypothesis such that you can reject it.
Hypothesis testing: set up a one-tailed t-test
ln (Number of market ln (Market capitalization of
Independent variable
makers) stocks)
Null hypothesis: H0 b1 ≥ 0 b2 ≥ 0
Alternative hypothesis: Ha b1 < 0 b2 < 0
If we reject the null hypothesis and accept the alternative hypothesis, then the following will
be proven true:
• The greater the number of market makers, the smaller the percentage bid-ask spread.
• Stocks with higher market capitalization have more liquid markets, hence lower bid-ask
spreads and lower trading costs.
We will use a t-test here at 0.01 level of significance. This gives us a critical value of 2.345
The t-value are given in the table: -18.7946 and -25.0093. If this information is not provided
in the table, we can also calculate the t stat using the formula:
b̂j − bj
t=
ŝ
bj
− 1.5186 − 0
For b1: t = = - 18.7946
0.0808
− 0.3790 − 0
For b2: t = = - 25.0093
0.0151
Decision: from the t-tests, we can reject the null hypothesis, since both the t-values of -
18.7946 and -25.0093 are less than -2.345 (the critical tc value).
Conclusion: We can conclude that stocks with a large number of market makers and high
market capitalization have lower bid-ask spreads.
The final section of the table, tells us how well the model fits or explains the data. Multiple R2
of 0.6318, means that the model explains about 63% of the variation of the data.
Calculate the bid-ask spread for a stock with a market capitalization of 100 million and 20
market makers.
Solution:
The regression equation is given by:
Y = 1.5949 – 1.5186X1 - 0.3790X2
Y = 1.5949 – 1.5186 ln20 - 0.3790ln100 = -4.6997
ln (bid-ask spread) = - 4.6997
bid-ask spread = ℮−4.6997 = 0.0091= 0.91%
The predicted value of the bid-ask spread is 0.91%.
2.1 Assumptions of the Multiple Linear Regression Model
Following are the six assumptions of a normal multiple linear regression model:
Let us go back to example 1, and look at the ANOVA section of regression output to see how
the values are calculated.
ANOVA df SS MSS F Significance F
Regression 2 3,728.1334 1,864.0067 2,216.7505 0.00
Residual 2,584 2,172.8870 0.8409
Total 2,586 5,901.0204
Using the above equation, let us now calculate the values for RSS, SSE, MSR and MSE:
The model has two slope coefficients, therefore there are two degrees of freedom in the
If there are n states of the world, then use n - 1 dummy variables. For example, if an analyst
wants to distinguish between average market returns in January and the rest of the year,
then one dummy variable is used (January or not January).
Example: Month-of-the-year effects on small stock returns
This is based on example 5 from the curriculum.
This example tests for the seasonality effect in small stock returns. The monthly returns data
from Jan. 2001 to Dec. 2013 is used. The regression equation uses 11 dummy variables (12
months -1), one for the return for each of the first 11 months in a year. The intercept
measures the returns for Dec.
Returnst = b0 + b1 Jant + b2 Febt + … + b11 Novt + εt
where:
each monthly dummy variable has a value of 1 when the month occurs and a value of 0 in
other months.
Given the results of the regression, test whether the return of the small stock index differs
across months. Can we reject the null hypothesis at 5% significance?
Coefficient Standard Error t-Statistic
Intercept 0.0273 0.0149 1.8322
January -0.0213 0.0210 1.0143
February -0.0112 0.0210 -0.5333
March 0.0101 0.0210 0.4810
April -0.0012 0.0210 -0.0571
May -0.0425 0.0210 -2.0238
June -0.0065 0.0210 -0.0395
July -0.0481 0.0210 -2.2905
August -0.0367 0.0210 -1.7476
September -0.0285 0.0210 -1.3571
October -0.0429 0.0210 -2.0429
November -0.0339 0.0210 -1.6143
The steps to determine significance for a given regression output are listed below:
1. The null hypothesis is that the slope coefficients are zero, or that the returns are equal
across the months.
2. The overall regression is not significant as R2 is low (0.1174).
3. Refer the F-table to determine the critical F-value at 0.05 significance level. Degrees of
freedom in the numerator: df1 = 11 and degrees of freedom in the denominator: df2 =
144. From the table, the critical F-value is between 1.79 and 1.87. The F-statistic is given
as 1.7421, which is smaller than the critical F value. We cannot reject the null hypothesis
that all the slope coefficients are equal to zero.
4. The p-value at 6.98 percent is higher than 5 percent. This means the smallest level of
significance at which the null hypothesis can be rejected is roughly 7 percent.
5. Analyzing the t-statistics is not required as it will only confirm the results of F-statistic.
From the t-statistic values, it can be seen that the coefficients for May, July, and October
are significant (greater than tc). But, for the remaining months, they are insignificant.
Therefore we cannot reject the null hypothesis that the returns are equal across the
months.
In this model, the regression residuals (or error terms) increase as the size of the
independent variable increases i.e. the error variance depends on the values of the
independent variable.
The Consequences of Heteroskedasticity
• F-test for the overall significance of the regression is unreliable.
• The regression coefficient estimates are fine, but the standard errors are understated.
• Understated standard errors make t-test for the significance of the individual
regression coefficients unreliable as the t-statistic may be overstated. This will lead to
significant relationships where none exist.
How to Test for Heteroskedasticity
The Breusch-Pagan test is widely used to detect heteroskedasticity. It tests the null
hypothesis that the regression’s squared error term is uncorrelated with the independent
variables (no conditional heteroskedasticity). The alternative hypothesis states that the
error term is correlated with the independent variables.
How to Correct for Heteroskedasticity
Two methods are used to correct for errors caused byheteroskedasticity:
• Robust standard errors: Corrects the standard errors of the linear regression
model’s estimated coefficients to account for conditional heteroskedasticity.
• Generalized least squares: Modifies the original equation in an attempt to eliminate
heteroskedasticity.
4.2 Serial Correlation
Serial correlation is also called autocorrelation. In serial correlation, regression errors in
one period are correlated with errors from previous periods. It is often found in time-series
regressions. The assumption that the regression errors are uncorrelated across observations
is violated here.
The Consequences of Serial Correlation
• Like heteroskedasticity, the regression coefficient estimates are fine, but the standard
errors are understated.
• t-stat and F-stat values are too high. This will lead to incorrectly rejecting the null
hypothesis (type I error)
How to Test for Serial Correlation
A common method for testing serial correlation is the Durbin-Watson (DW) statistic. The
formula for DW statistic is:
DW ≈ 2(1 - r)
where:
DW = Durbin-Watson statistic
r = sample correlation between the regression residuals from one period and those from the
previous period
We will not go into the details of the formula, but just a few important points related to how
to interpret the statistic are listed below:
• The DW statistic takes on values ranging from 0 to 4. 0 < DW < 4
• Positive serial correlation: DW < 2
• Negative serial correlation: DW > 2
• When serial correlation = 1, DW = 0
• When serial correlation = -1, DW = 4
It is not possible to determine a single critical value for the Durbin-Watson statistic, but we
can determine if it lies between two values, dl (lower value) and du (upper value), or outside
those values.
For example, the scatter plot below shows there is no correlation between the
variables X and Y in each cluster. For the combined data, however, there is a high
correlation between X and Y which is misleading. This is because the data is drawn
from two different samples that should not have been pooled.
discrete outcome. In the bankruptcy example, the outcome we are expecting to make a
decision is 0 or 1. But, it is highly unlikely that we will get a discrete outcome using a linear
regression. The predicted value will be either greater than 1 or less than 0. Since the
probability cannot be greater than 100 percent or less than 0 percent, a model that gives a
discrete outcome such as probit/logit models will be an appropriate choice.
Probit and logit models estimate the probability of a discrete outcome given the values of
the independent variables used to explain that outcome.
The probit model, which is based on the normal distribution, estimates the probability that Y
= 1 (a condition is fulfilled) given the value of the independent variable X.
The logit model is identical, except that it is based on the logistic distribution rather than a
normal distribution.
Discriminant analysis yields a linear function that is used to create an overall score. This
score is then used to classify an observation into the bankrupt or non-bankrupt category.
7. Machine Learning
So far we have focused on multiple linear regression, where we work with relatively small
and well organized data. However, if we are working with ‘Big Data’, these techniques are
not useful. Big Data has the following characteristics:
• Volume: There is an immense volume of data as compared to traditional data.
• Variety: The data may be structured or unstructured. We may have text or video
based data.
• Velocity: We may have to analyze data on real time basis.
To analyze Big Data we have to use machine learning. An important concept under machine
learning is ‘data analytics’ where we extract information from Big Data, which gives us
actionable insights.
7.1 Major Focuses of Data Analytics
There are six major focuses of data analytics:
1. Measuring correlations: This focuses on understanding the contemporaneous
relationships between variables. Contemporaneous means that we are working with
variables at the same point in time. For example, we might be interested in measuring
the correlation between Microsoft returns and the returns of the S&P 500 index.
2. Making predictions: Here we try to determine if one variable (X) can be used to
predict another variable (Y). For example, we might measure the number of social
media posts related to Apple and try to predict the movement in Apple’s stock price
based on this information.
3. Making causal inferences: Here we try to determine if a change in an independent
variable (∆X) causes a change in a dependent variable (∆Y). A causal relationship
implies that there is some underlying mechanism that connects the dependent
variable with the independent variable. A causal relationship is a stronger
relationship as compared to correlations and predictions.
4. Classifying data: This focuses on sorting data into distinct categories. For example,
we may consider millions of credit card transactions and try to determine whether a
transaction is fraudulent or not. Here, our classification model is binary. It is also
possible to perform multi-category classification, for example, when assigning credit
ratings.
5. Sorting data into clusters: This involves sorting data into clusters, such that
observations within the same cluster are similar to each other and are different from
observations in other clusters. The basis for creating these clusters may or may not
be specified in advance. For example, an equity analyst may divide a group of stocks
into clusters based on the risk-return characteristics of the stocks rather than
traditional classification systems such as sectors or industries.
6. Reducing the dimension of data: This requires reducing the number of independent
variables (X), while retaining the predictive power of the model. For example, let’s
say we initially have a model with 10 independent variables (X1, X2,… X10) that predict
a dependent variable (Y). Assume that by a dimension reduction exercise, we
determine that just three independent variables (X1, X2, X3) predict Y just as well.
Dimension reduction is important because it helps us identify the variables that have
the biggest relationship with the independent variable. Also, it has been shown that
simpler models with fewer independent variables work better than complex models
for out-of-sample forecasting.
Historically, the six techniques mentioned above have been used with structured data. But in
recent times with the developments in machine learning, these techniques are also being
used with Big Data.
7.2 What Is Machine Learning?
Machine learning (ML) is a subset of artificial intelligence (AI). In ML, machines display
intelligent decision making abilites through activities such as sensing, reasoning, and
understanding and speaking languages. For example, Apple’s digital assistant Siri can be
considered a machine learning program. In machine learning, machines are programmed to
improve performance in specified tasks with experience. A formal definition of machine
learning is provided below:
‘A computer program is said to learn from experience E with respect to some class of tasks T
and performance measure P if its performance at tasks in T, as measured by P, improves
with experience E.’
We can understand this definition through a simple example. Let’s say we want to apply
machine learning to determine whether a credit card transaction is fraudulent or not. Here
the terms can be defined as:
• T (task): Predict whether credit card transactions are fraudulent or not
• P (performance measure): Percentage of transactions correctly predicted
• E (experience): Number of credit card transactions that the program works with.
Let’s say that for the first 1 million transactions the performance measure was 80%, which
means that while predicting whether a credit card transaction was fraudulent or not, our
program was 80% accurate.
However, for the next 1 million transactions, when the experience level has reached a total of
2 million transactions, we observe that the performance has improved to 90%. This means
that the program has learnt through experience and improved its performance. This is an
example of machine learning.
7.3 Types of Machine Learning
Broadly speaking, we have two types of machine learning: supervised learning and
unsupervised learning.
Supervised learning is machine learning that makes use of labelled training data. A formal
definition is: ‘Supervised learning is the process of training an algorithm to take a set of
inputs X and find a model that best relates them to the output Y.’
In the credit card example, the ML program is given training data, which may consist of
several hundred transactions of different amounts, different origins etc. These transactions
are labelled as ‘fraudulent’ or ‘not fraudulent’. The ML program learns from this labelled
training data and can predict whether new transactions are fraudulent.
Supervised learning is useful for classifications and predictions.
Unsupervised learning is machine learning that does not make use of labeled training data.
Several input variables are used for analysis but no output (or target variable) is provided.
Because tagged data is not provided, the ML program has to discover structure within the
data on its own. An example of where unsupervised learning is applied is clustering. For
example, a program may be given financial information about companies and the program
can cluster companies into groups based on their attributes.
Unsupervised learning is also used for dimension reduction where we want to reduce the
number of independent variables in a model.
Machine learning vocabulary: In multiple regression, we call our X variables the
independent variables and the Y variable the independent variable. In machine learning, the
Y variable is also called the ‘tag’ variable or the ‘target’ variable and the X variables are called
‘features’.
Supervised learning
Supervised learning can be divided into two categories: regression and classification. In
regression the target variable (Y) is continuous. In classification, the target variable (Y) is
either categorical (categories cannot be ranked) or ordinal (categories can be ranked for
example credit ratings).
Let us now look at the specific types of supervised learning algorithms.
Penalized Regression:
• It is a computationally efficient technique used in prediction problems.
• It is a special case of the generalized linear model. In a GLM model, we can specify
how the model is calibrated and the model’s parsimony (using as few X variables as
possible)
• Penalized regression addresses the overfitting problem through a process called
regularization. Overfitting occurs because of excessive data mining of training data
and the model does not work well with out of sample data.
• The regression coefficients are chosen to minimize sum of squared residuals plus a
penalty term that increases with the number of included variables with non-zero
coefficients. Because of this penalty, the model remains parsimonious and only the
most important variables for explaining Y remain in the model.
Classification and Regression Trees (CART)
• It is a computationally efficient technique that is adaptable to datasets with complex
structures.
• Cart can be applied to predict either a categorical or continuous target variable.
▪ If we are predicting a categorical target variable, then a classification tree is
produced.
▪ Whereas, if we are predicting a continuous outcome, then a regression tree is
produced.
• It is commonly applied for classification problems and when target is binary. For
example, classifying credit card transactions as fraudulent or not fraudulent.
Random Forests
• A random forest classifier is a collection of classification trees.
• Instead of just one classification tree, several classification trees are built based on
random selection of features (variables).
• Random forests protect against overfitting on the training data.
• Random forests are an example of ensemble learning, whereby signal-to-noise ratio is
improved, because errors cancel each other out across the collection of classification
trees.
Neural Networks (Artificial Neural Networks)
• Neural networks are applied to tasks characterized by non-linearities and
interactions among variables.
• Neural networks have layers of nodes connected by links:
▪ Input layer nodes correspond to features used for prediction
▪ Hidden layer(s) feeds an output node
▪ Output node generates predicted value
The following figure illustrates a neural network.
Summary
LO:a. Formulate a multiple regression equation to describe the relation between a
dependent variable and several independent variables and determine the statistical
significance of each independent variable.
A multiple regression allows us to determine the effect of more than one independent
variable on a particular dependent variable.
A multiple regression model is given by:
Yi = b0 + bi X1i + b2 X 2i + εi , i = 1,2,….n
LO:b. Interpret estimated regression coefficients and their p-values.
The slope coefficient bj measures how much the dependent variable Y changes when the
independent variable, Xj, changes by one unit holding all other independent variables
constant.
The lower the p-value for a test, the more significance the result.
LO:c. Formulate a null and an alternative hypothesis about the population value of a
regression coefficient, calculate the value of the test statistic, and determine whether
to reject the null hypothesis at a given level of significance.
H0: bj = 0
Ha: bj ≠ 0
̂
bj − bj
The t-test (t = ) is used to determine the statistical significance of the population value
sb̂
j
of a regression coefficient.
LO:d. Interpret the results of hypothesis tests of regression coefficients.
If the t-statistic is more that the upper critical t-value (or less than the lower critical t-value)
then we can reject the null hypothesis and conclude that the regression coefficient is
statistically significant.
LO:e. Calculate and interpret 1) a confidence interval for the population value of a
regression coefficient and 2) a predicted value for the dependent variable, given an
estimated regression model and assumed values for the independent variables.
Confidence Interval = b̂j ± (critical t-value) (SE of the coefficient)
̂0 + b
̂i = b
Predicted value Y ̂1 X
̂1i + b̂2 X ̂k X
̂ 2i + ⋯ … + b ̂ ki