Math Supplement PDF
Math Supplement PDF
Math Supplement PDF
1
1.1
Binary Classification
Diagram of Confusion Matrix with Basic Definitions
Outcome
Where,
(a,b) is the Marginal Probability Distribution of Condition p(Y ). Note that
the rarer of the two is traditionally assigned to + and the probability p(a) is
called the incidence of a.
(c,d) is the Marginal Probability Distribution of the Classification p(X)
(e,f,g,h) is the Joint Probability Distribution of the Condition and the Classification, p(X, Y ).
Another way of representing the confusion matrix is:
Outcome
1.2
True Positive (TP) Rate: The conditional probability that someone with the
condition has a positive test.
e
= p(T estP ositive|+)
a
False Negative (FN) Rate: The conditional probability that someone with the
condition has a Negative test.
f
= p(T estN egative|+)
a
Note that T P rate + F N rate = 1.
False Positive (FP) Rate: The conditional probability that someone who does
not have the condition has a positive test.
g
= p(T estP ositive|)
b
True Negative (TN) Rate: The conditional probability that someone who does
not have the condition has a negative test
h
= p(T estN egative|)
b
Note that F P rate + T N Rate = 1.
Positive Predictive Value (PPV): The conditional probability that someone who
has a positive test, has the condition.
e
= p(+|T estP ositive)
c
g
1 PPV =
c
1.3
Known Data x about each unique item (radar image, medical test subject, potential borrower) is converted through some scoring function s(x) into a single
2
2
2.1
2.1.1
Information Measures
Probability Review
Basic Probability Definitions
Sum rule
The marginal probability p(X) is equal to the sum of the joint probabilities
p(X, Y ) = p(X|Y )p(Y ) over all possible values of Y .
Product rule
The joint probability p(X, Y ) is equal to the product of the conditional probability p(X|Y ) and the marginal probability p(Y ).
In other words, p(A, B) = p(A|B)p(B).
Bayes Theorem
Given that p(A, B) = p(B, A) the product rule gives p(A|B)p(B) = p(B|A)p(A).
Dividing both sides of the above equation by p(B) gives:
p(A|B) =
2.2
2.2.1
p(B|A)p(A)
p(B)
Units
Change-of-Base Formula
1
= 0.6931
1.443
or,
=
2.2.4
1 bit
= 0.6931 nats
1.443 bits/nat
The joint entropy H(X, Y ) of a pair of discrete random variables (X, Y ) with a
joint distribution P (x, y) is defined as:
XX
H(X, Y ) =
p(x, y) log(p(x, y))
xX yY
2.2.5
np(xi )H(Y |X = xi )
i=1
H(X|Y ) =
np(yi )H(X|Y = yi )
i=1
2.2.6
p(x) log
xX
p(x)
q(x)
Given two discrete random variables X, Y with probability mass function p(x, y)
and marginal probability mass functions p(x), p(y):
I(X; Y ) = D(p(x, y, )||p(x)p(y))
In other words, the mutual information I(X;Y) equals the relative entropy between the joint distribution p(x,y) and the product distribution p(x)p(y).
This is consistent with the definition of Independence: When p(x, y) = p(x)p(y)
then
I(X; Y ) = D(p(x, y, )||p(x)p(y)) = 0
2.2.10
1. I(X; Y ) = H(X)H(X|Y )
2. I(X; Y ) = H(Y )H(Y |X)
3. I(X; Y ) = H(X) + H(Y )H(X, Y )
4. I(X; Y ) = D(p(X, Y )||p(X)p(Y ))
7
2.3
2.3.1
Linear Regression
Standardization of a Set of Numbers
Values xi X will be in units such as miles per hour, yards, pounds, dollars,
etc. It is often convenient to standardize units by converting each value xi into
standard units. Standard units are expressed in standard deviations from
the mean of X.
A data point represented in standard units is also known as a Z-Score. To
convert a data point into its corresponding Z-score, subtract the mean of the
data set from the individual value, then divide by the standard deviation of the
data set.
In other words, the Z-score of xi is calculated as follows:
zi =
xi x
Note that individual values larger than the mean will be positive, and values
less than the mean will be negative. The mean has a Z-Score of 0.
Note also that when a set of values X = x1 , x2 , ...xn is expressed in Standard
Units as Z-scores, so long as the mean x
and standard deviation x are known,
all information about the original values is preserved, and can be recovered at
any time.
A data set can be converted into standard units by using the Excel function
Standardize.
2.3.2
The residual equals the distance between the true value yi and a point on the
regression line yi = + xi which is the point estimate of yi .
Root mean square error is calculated as follows for a set of ordered pairs
(x1 , y1 ), (x2 , y2 ), (x3 , y3 ):
1. Each residual is squared,
(yi + xi )2
2. The squared errors are added together
n
X
(yi xi )2
i=1
1X
(yi xi )2
n i=1
4. The square root of the resulting mean is taken,
v
u n
u1 X
t
(yi xi )2
n i=1
This value is the root mean square (r. m. sq.) error of the model on a
particular set of n ordered pairs.
The regression line that minimizes root mean square error is known as the best
fit line.
Note that because taking the square root and dividing by n are both strictly
increasing functions, it is sufficient to minimize the sum of squares:
Q=
n
X
(yi xi )2
i=1
to determine the model parameters and for the best fit line.
It can be demonstrated that when andP are chosen to minimize the root
n
mean square residual, the mean residual ( i=1 (yi xi )2 ) = 0. Therefore,
the root mean square residual is equal to the standard deviation of residuals,
e .
2.3.3
Pn
xn yn
i=1 xi yi n
P
n
2 n
x2n
x
i=1 i
Pn
i=1 (yi
xi )2 with respect to ,
dQ
=0
d
n +
n
X
xi =
i=1
n
X
yi
i=1
Solving for ,
= yn x
n
These two equations are known as the normal equations for a regression
line.
The notation and
with hats signifies that these values are calculations
based on a particular finite set of n ordered pairs, (x1 , y1 ), (x2 , y2 )...(xn , yn ).
Adding more ordered pairs will change the values for and
. When we can
assume the existence of a stationary process relating two dependent random
variables X and Y, then, at the limit, as the size of our sample of ordered
pairs n gets very large, will approach the true value , and
will approach
the true value .
2.3.4
10
zx1 =
x1 x
=0
and
Pn
xi yi
= Pi=1
n
2
i=1 xi
=
CovXY
=R
V arX
The correlation coefficient which is called by definition R. The best fit regression
line for standardized ordered pairs passes through the origin (0,0) and its slope
equals the correlation.
2.3.5
For parametric models we are interested in the family of Gaussian (also called
normal) probability density functions of the form:
f (x; ; ) =
1
2 2
(x)2
2 2
Where and 2 are parameters, and 6= 0. These functions are all continuous
and differentiable.
2.3.6
x0 f (x)dx = 1
E(X ) =
xf (x)dx =
11
the mean.
The 2nd Moment
E(X 2 ) =
x2 f (x)dx = 2
the variance.
Note that is also the median and mode of a Gaussian. The square root of the
variance, , is known as the standard deviation.
2.3.7
The function F (x) = p represents the probability that a random variable falls
in the interval (, x].
f (x) =
1
2 2
(x)2
2 2
Note that the derivative of the cumulative normal function F 0 (x) = f (x).
When adding two independent Gaussians, the resulting distribution is also a
Gaussian, with mean equal to the sum of the means, and variance equal to the
sum of the variances.
1 (m1 , v1 ) + 2 (m2 , v2 ) = 3 (m3 , v3 )
The converse is also true: if the sum of two independent random variables is a
Gaussian, both of those random variables must be Gaussians.
Linear transformations of Gaussian random variables are also Gaussians.
If X = (, 2 ),
aX + b = (a + b, a2 , 2 )
2.3.9
The Gaussian with mean = 0 and variance 2 = 1, (0, 1), is known as the
standard normal distribution. It has cumulative normal function:
1
F (x) =
2
12
(x)2
2
2.3.10
Models that assume data are drawn from known distributions, with known mean
and variance, are called parametric models. The most common parametric
models assume that observed data are draws from random variables with a
normal, or Gaussian, distribution.
One parametric model often used is to assume that the residuals, yi xi
are normally distributed with mean = 0 and Standard Deviation = .
Each value yi can then be represented as:
yi = + xi + ,
where is an independent random draw from Z = (0, 2 ) , a normal distribution with mean = 0 and variance = 2 .
If we further assume that x values are drawn from a random variable X with
normal distribution, mean = x
and standard deviation = x , then Y, as the
sum of two normal distributions, must also be a normal distribution.
Note that Z is independent of X.
Therefore the variances add, so that:
y2 = 2 x2 + 2
By the formula for adding means and variances of two normal distributions, Y
is a normally-distributed random variable, with mean = x
+ , and variance
= 2 x2 + 2 .
2.3.11
It should be apparent that the larger the variance of the residual (relative to
the variance of Y), the smaller the absolute value of the correlation R.
When X and Z are normally-distributed, the exact relationship between
and R can be determined by substitution for , which is related to R by the
formula:
=R
y
x
Since,
y2 = 2 x2 + 2
y2 =
R2 y2 x2
+ 2
x2
= R2 y2 + 2
Rearranging terms,
2 = y2 R2 y2
13
R2 = 1
or
s
1
R=
2
y2
2
y2
p
1 2 .
where subscript S indicates that the domain is limited to the support set of the
random variable, that part of the real line where f (x) 6= 0.
Differential entropy can be interpreted as a measure of the uncertainty about
the value of a continuous random variable, or a measure of missing information.
Random variables collapse to h(X) = 0 once they are associated with a known
outcome or event x.
2.3.13
f (x) ln f (x)dx
S
x2
2 2
e 22 [
x2
ln 2 2 ]dx
2
2
x2
= ( f (x)[ 2 ]dx ln 2 2 )
2
=(
1
2 2
x2 f (x)dx) + ln
2 2
E(X 2 ) 1
+ ln(2) + ln
2 2
2
14
1
+ 0.9189 + ln
2
= 1.4189 + ln
conversion bits
2.3.14
= H(Y )H(Z)
= ln
By substitution from R2 = 1
2
y2
y
above,
1
= ln(1 R2 )
2
or, in terms of entropy to the base 2,
1
1
= log
2
2
1 Rxy
Note that, unlike discrete entropy, differential entropy in infinite (undefined)
when R2 = 1.
Note that the linear model, plus knowledge of X, leaves H(Y|X) = residual Z;
it is possible that a better-than-linear model exists. For this reason, when R is
known we know that the mutual information I(X;Y) is
15
1
1
log
2
2
1 Rxy
Converting Linear Regression Point Estimates to Probabilistic Forecasts (When Data are Parametric)
Each value xi , when combined with the linear model yi = +xi , and the known
root mean square residual can be thought of as providing either,
1. a point forecast yi , or
2. a probabilistic forecast in the form of Gaussian probability distribution
with mean = yi and standard deviation = .
Example
Assume yi = 2 and of the linear model = 3. Assume the errors have a
Gaussian distribution.
The true value of yi is unknown. Suppose you need to know the probability
that the true value of yi > 5.
This probability equals 1-(the cumulative normal distribution from to 5)
of the Gaussian function with mean = 2 and standard deviation = 3.
This is equal to the cumulative standard normal distribution from to -1.
In Excel, this is =norm.s.dist(-1, true) and is equal to 15.67%.
Suppose you need to know the probability that the true value of yi < 0. This
probability is equal to the cumulative standard normal probability distribution
from to -2/3. In Excel this is =norm.s.dist(-.66667, true) and is equal to
25.25%.
A more precise conversion to probabilities can be adopted that assumes that
the models values for and are themselves estimates, derived from a set of
n known ordered pairs. This more advanced model is beyond the scope of the
present discussion.
2.3.16
Note: Adjustment of the root mean square error of a point estimate when linear
regression is calculated on a sample of small size.
We typically assume that our data set for linear correlation forecasting is parametric meaning that we assume that our ordered pairs (x, y) of unstandardized
or standardized data are actually drawn from underlying Gaussian Probability
Distributions X and Y with constant true standard deviations and covariances, and consequently correlation and root mean square error.
Weve observed that for any finite sample of ordered pairs drawn from the above
random variables, the values for beta, the slope of a regression line, for alpha,
16
the y-intercept, and for the linear correlation R, can and will all differ from the
true values for the random variables.
Similarly, the observed root mean square error of the best-fit line will also differ
from the true error as a function of the number of observations n. Interestingly, the true error is also changed by the z-score of each individual x value
the farther the x value is from the mean, the greater the true error. We can
adjust the confidence interval for individual point estimates to take these variations into account. However, in general, the difference between the observed
error and the true error is small if n is larger than 100 and z is between -3
and 3. The formula for the True Error of a point estimate, adjusted for n and
sample size, is
r
(z 2 + 1 + n)
RM S
n
For values of n greater than 100 and z-scores between (1.5to1.5), the theoretical
adjustment is always less than 2%, and can safely be ignored.
On samples with small n and large z-scores the adjustment is worthwhile. For
example, assume the observed root mean square error is 0.74, sample size of
n = 50 ordered pairs, and a particular x(i) has a z score = 2.
The best estimate for the true error of the y point estimate is
r
4 + 1 + 50
0.74
= (.74)(1.049) = .78.
50
This increase of approximately 5% in the error is also accompanied by a change
of the distribution away from a pure Gaussian shape. However, when n is larger
than 100 a Gaussian is a very good approximation for the distribution of the
true error.
17