Unit 1 - Part 1
Unit 1 - Part 1
Unit 1 - Part 1
Unit 1
Course Instructor: Aparna Krishna
Winter 2023 – 2024
IIT (ISM) Dhanbad
About the Course
• Objectives:
• Provide knowledge of modern econometric techniques commonly
employed in the finance literature.
• Learning Outcomes:
• Understand the essential foundations of time series models.
• Construct and evaluate forecast models using financial time-series.
Explain and apply models of volatility using financial time-series.
Course Outline
• Unit 1: Foundations of time series models – construct and forecast models
• More generally,
• More generally,
Y = β0 + β1x1 + β2x2 + β3x3 + u
Y: dependent variable
x1’ x2’ x3: independent variable
Β0, β1, β2, β3: coefficients to be estimated
u: error term representing combined effect of omitted variables
Goals of Econometric Model
Wage = β0 + β1 education + β2 experience + β3 training + u
• Examples:
- A poll of usage of internet stock broking services
- Cross-section of stock returns on the New York Stock Exchange
- A sample of bond credit ratings for UK banks
• Continuous data: take on any value and are not confined to take specific
numbers
• For example, the rental yield on a property could be 6.2%, 6.24%, or
6.238%.
• Cardinal, ordinal and nominal variables may require different modelling approaches or at
least different treatments
Overview of Classical Linear Regression Model (CLRM)
Regression analysis is concerned with the study of the dependence of one variable,
the dependent variable, on one or more other variables, the explanatory variables,
with a view to estimating and/or predicting the (population) mean or average value
of the former in terms of the known values of the latter.
Example
• Qs: Do people’s expenses increase as their income increases?
Data:
Income Expense
80 55 • What is dependent variable? Independent
80 60 variable?
80 65 • What do different income levels denote?
80 70
• Please compute:
80 75
a) Average value of expenses
100 65
b) Average value of expenses when income is 80
100 70
and 100 respectively
100 74
100 80
100 85
100 88
Conditional vs Unconditional Mean
• Average value dependent variable: unconditional expected value, or E(Y)
E(Y | Xi ) = β1 + β2Xi
Linear Regression Function or Linear regression equation
• ‘Linear’ in regression
always means linear in
parameters, and not
necessarily in explanatory
variables
Stochastic Specification of PRF (1)
Stochastic Specification of PRF (2)
• Stochastic: having a random probability
distribution or pattern that may be
analysed statistically but may not be
predicted precisely
ui = Yi − E(Y | Xi )
yi
Rearrange:
ûi
Yi = E(Y | Xi ) + ui ŷi
515 510
510 500
505 490
480
500
470
495
460
490 1970 1975 1980 1985 1990 1995 2000 2005 2010
1970 1975 1980 1985 1990 1995 2000 2005 2010
Male Female
Male Female
530
Maths
500
490
480
470
460
490 495 500 505 510 515 520 525 530 535
Reading
Example 3: Income and Scores
Reading Maths
1.2 1.2
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2
• Reasons? 0.8
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1 1.2
Ordinary Least Squares
Sample Regression Function
• 2 problems:
- Each residual will get equal weight
- Residuals may cancel out each other
Homoscedasticity Heteroscedasticity
• Assumption 3 and 4 together imply that errors in CLRM follow normal distribution
• Assumption 5: No autocorrelation between the disturbances; Cov (ui,uj)=0
• Errors of two values of an independent variable are statistically independent of
each other
• Assumption of no serial correlation, or no autocorrelation
• Helps with easier interpretation of Y in terms of X
• Assumption easy to justify in cross-section but difficult in time series
• Assumption 6: Number of observations (n) must be greater than
number of parameters to be estimated
10.0
8.0
6.0
4.0
2.0
0.0
0 1 2 3 4 5 6
Actual Vs Predicted
9
Weight1000lb Line Fit Plot
8
12.0
7
10.0
GallonsPer100Miles
6
8.0
5
6.0 GallonsPer100Miles
4
Predicted
4.0
GallonsPer100Miles
3
2.0
2
1 0.0
0 1 2 3 4 5 6
Weight1000lb
0
0.0 2.0 4.0 6.0 8.0 10.0 12.0
e vs x
4.00000
3.00000
2.00000
1.00000
0.00000
0 1 2 3 4 5 6
-1.00000
-2.00000
-3.00000
normal probability plot
5
0
-4 -3 -2 -1 0 1 2 3
-1
-2
-3
-4
• Can also indicate the minimum number of values needed for estimation.
• DoF of Mean / Average: n
• DoF of Variance: n-1
• DoF of Regression: n-2
Standard Errors of Least-Squares Estimates
• SE: Absolute measure of the typical distance that data points fall
from regression line
• Variance of the error term
Note:
• Inverse relationship b/w variance of estimator and
independent variable. Hence greater the variance
in X, greater precision in computing estimators
Where • Bigger the sample size, more number of terms in
and more precise estimator
Some Comments on the Standard Error Estimators
y
y
y y
x
0 x x
0 x
Goodness of Fit: r2
• Goodness of fit checks how well sample regression line fits the data
• Measures the proportion of total variation in Y explained by the
regression model
(a) r2 = 0
(f) r2 = 1
Goodness of Fit: r2
• TSS: Total Sum of Squares
• ESS: Explained Sum of Squares
• RSS: Residual Sum of Squares
• r2 = ESS/TSS
• Properties of r2
• Nonnegative
• Takes value between 0 and 1
SE vs r2
• SE: Absolute measure of the typical
distance that the data points fall
from the regression line. It is in the
units of the dependent variable
• Always made in pairs - null hypothesis (H0) and alternative hypothesis (H1)
$ N(, Var())
• If the errors are not normally distributed, will parameter estimates still be ND?
• Yes, if the other assumptions of CLRM hold, and sample size is sufficiently large
• From central limit theorem
Probability Distribution of the Least Squares Estimators (cont’d)
ˆ ˆ
~ N 0,1 ~ N 0,1
var
and var
2-Sided Test 1-Sided Test: Upper Tail 1-Sided Test: Lower Tail
The Test of Significance Approach: Drawing Conclusions
5. Use the t-tables to obtain a critical value (values with which to compare the test
statistic)
6. If the test statistic lies in the rejection region then reject the null hypothesis (H0), else
do not reject H0.
Tests of Significance: An Example
2. Choose a significance level, , (again the convention is 5%). This is equivalent to choosing a (1-
)100% confidence interval, i.e. 5% significance level = 95% confidence interval
3. Use the t-tables to find the appropriate critical value, which will again have T-2 degrees of freedom.
5. Perform the test: If the hypothesised value of (*) lies outside the confidence interval, then reject the
null hypothesis that = *, otherwise do not reject the null.
Confidence Interval Approach: An Example
• Regression result: yˆ t = 20.3 0.5091xt
(14.38) (0.2561)
T=22
• Note that the Test of Significance and Confidence Interval approaches always give the same answer.
•
• Under the test of significance approach, we would not reject H0 that = * if the test statistic lies within the non-rejection
region, i.e. if
$ *
tcrit £ £ tcrit
SE ( $ )
•
t crit SE( ˆ ) £ ˆ * £ t crit SE( ˆ )
Rearranging, we would not reject if
ˆ t crit SE ( ˆ ) £ * £ ˆ t crit SE ( ˆ )
• If we reject the null hypothesis at the 5% level, we say that the result of the test is statistically
significant.
• Note that a statistically significant result may be of no practical significance. E.g. if a shipment of
cans of beans is expected to weigh 450g per tin, but the actual mean weight of some tins is 449g, the
result may be highly statistically significant but presumably nobody would care about 1g of beans.
Example: Stata Regression Output
Research setting: Studies show that exercising can help prevent heart disease. Within reasonable limits, the
more you exercise, the less risk you have of suffering from heart disease. One way in which exercise reduces
your risk of suffering from heart disease is by reducing a fat in your blood, called cholesterol. The more you
exercise, the lower your cholesterol concentration. Furthermore, it has recently been shown that the amount of
time you spend watching TV – an indicator of a sedentary lifestyle – might be a good predictor of heart disease
(i.e., that is, the more TV you watch, the greater your risk of heart disease).
Research thought process: Therefore, a researcher decided to determine if cholesterol concentration was related
to time spent watching TV in otherwise healthy 45 to 65 year old men (an at-risk category of people). For
example, as people spent more time watching TV, did their cholesterol concentration also increase (a positive
relationship); or did the opposite happen? The researcher also wanted to know the proportion of cholesterol
concentration that time spent watching TV could explain, as well as being able to predict cholesterol
concentration. The researcher could then determine whether, for example, people that spent eight hours spent
watching TV per day had dangerously high levels of cholesterol concentration compared to people watching just
two hours of TV.
Research set-up: To carry out the analysis, the researcher recruited 100 healthy male participants between the
ages of 45 and 65 years old. The amount of time spent watching TV (i.e., the independent variable, time_tv) and
cholesterol concentration (i.e., the dependent variable, cholesterol) were recorded for all 100 participants.
Expressed in variable terms, the researcher wanted to regress cholesterol on time_tv.
Stata Output Third table shows results from
parameter estimation
Components:
• Beta estimates
• Standard error of estimated
parameters
• t-statistics of estimated
parameters (Coef. / Std. Err.)
• P>|t| : p-value/probability
associated with t-statistics
• [95% Conf. Interval]: upper
and lower boundaries of co-eff
95% of the time