Deep Learning Unit 2
Deep Learning Unit 2
Unit 2
ELECTIVE-VIII (BTCOE705 (B))
Feed-Forward Neural Networks
RMSPROP Optimizer for Gradient Descent
• RMSPROP [ Root Mean Square Propagation]
• Extension to the gradient descent optimization algorithm
zero
Classification based on Backpropagation
• Neural Networks
• Here if the actual output is not the desired one then we backpropagate
to the hidden layer and check with the updated weights and re-
compute the output till the desired output is produced.
Backpropagation algorithm contains
• D: Dataset
• T: Target attributes
• w: weights
• B: bias
Backpropagation algorithm works in two
stages
• Two layer Network
So, Ij = (∑ wij * Oi ) + bj
3. Now compute output of hidden layer ie., Oj
Oj = 1 / (1 + e-IJ )
4. So, this way we can compute output for as many hidden layers as
we want.
5. Now same way compute input for the output layer k, ie., Ik
So, Ik = (∑ wjk * Oj ) + bk
• Ek = Oj (1 - Oj )(Tj - Oj )
D. Similarly, calculate the error factor for the jth layer i.e., hidden layer.
• Ej = Oj (1 - Oj ) ∑wjk * Ek
41
Principal Components Analysis Ideas ( PCA)
42
Principal Component Analysis
◼See online tutorials such as
http://www.cs.otago.ac.nz/cosc453/student_tutorials/princi
pal_components.pdf X2
Y1
Y2
x
x
x
Note: Y1 is the x xx
x x
first eigen vector, x
x x
Y2 is the second. x
x
Y2 ignorable. x x
x x
x x X1
x x Key observation:
x x
x x variance = largest!
43
Principal Component Analysis: one
attribute first Temperature
42
40
• Question: how much 24
spread is in the data 30
along the axis? 15
(Xi − X ) 2 30
35
s =
2 i =1
(n − 1) 30
40
30
44
Now consider two dimensions
X=Temperature Y=Humidity
Covariance: measures the 40 90
correlation between X and Y
• cov(X,Y)=0: independent 40 90
•Cov(X,Y)>0: move same dir 40 90
•Cov(X,Y)<0: move oppo dir 30 90
15 70
15 70
15 70
30 90
n
15 70
( X i − X )(Yi − Y )
i =1 30 70
cov(X , Y ) =
(n − 1) 30 70
30 90
40 70 45
More than two attributes: covariance
matrix
• Contains covariance values between all possible
dimensions (=attributes):
C nxn
= (cij | cij = cov( Dimi , Dim j ))
• Example for three attributes (x,y,z):
2 3 3 12 3
x = = 4 x
2 1 2 8 2
47
Eigenvalues & eigenvectors
• Ax=x (A-I)x=0
• How to calculate x and :
• Calculate det(A-I), yields a polynomial (degree n)
• Determine roots to det(A-I)=0, roots are eigenvalues
• Solve (A- I) x=0 for each to obtain eigenvectors x
48
Principal components
• 1. principal component (PC1)
• The eigenvalue with the largest absolute value will indicate
that the data have the largest variance along its
eigenvector, the direction along which there is greatest
variation
• 2. principal component (PC2)
• the direction with maximum variation left in data,
orthogonal to the 1. PC
• In general, only few directions manage to capture
most of the variability in the data.
49
Steps of PCA
• Let Xbe the mean
• For matrix C, vectors e
vector (taking the mean (=column vector) having
of all rows) same direction as Ce :
• Adjust the original data • eigenvectors of C is e such
that Ce=e,
by the mean • is called an eigenvalue of
X’ = X – C.
X
• Compute the covariance • Ce=e (C-I)e=0
matrix C of adjusted X
• Most data mining packages
• Find the eigenvectors do this for you.
and eigenvalues of C.
50
Eigenvalues
• Calculate eigenvalues and eigenvectors x for
covariance matrix:
• Eigenvalues j are used for calculation of [% of total variance]
(Vj) for each component j:
j n
V j = 100 n x =n
x
x =1
x =1
51
Principal components - Variance
25
20
Variance (%) 15
10
0
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
52
Transformed Data
• Eigenvalues j corresponds to variance on each
component j
• Thus, sort by j
• Take the first p eigenvectors ei; where p is the number of
top eigenvalues
• These are the directions with the largest variances
yi1 e1 xi1 − x1
yi 2 e2 xi 2 − x2
... = ...
...
y e x − x
ip p in n 53
An Example Mean1=24.1
Mean2=53.8
30 87 5.9 33.25
40
30
30 23 5.9 -30.75 20
10
-10
-20
-40
15 32 -9.1 -21.75
54
Covariance Matrix
75 106
• C=
106 482
55
If we only keep one dimension: e2
0.5
yi
0.4
-10.14
0.3
xi1
yi = (0.21 − 0.98) = 0.21* xi1 − 0.98 * xi 2
xi 2
56
57
58
Applications – Gene expression analysis
• Reference: Raychaudhuri et al. (2000)
• Purpose: Determine core set of conditions for useful
gene comparison
• Dimensions: conditions, observations: genes
• Yeast sporulation dataset (7 conditions, 6118 genes)
• Result: Two components capture most of variability (90%)
• Issues: uneven data intervals, data dependencies
• PCA is common prior to clustering
• Crisp clustering questioned : genes may correlate with
multiple clusters
• Alternative: determination of gene’s closest neighbours
59
Singular Value Decomposition
Underconstrained Least Squares
• More subtle version: more data points than unknowns, but data
poorly constrains function
• Example: fitting to y=ax2+bx+c
Underconstrained Least Squares
T
w1 0 0
= 0 0
A U V
0 0 wn