0% found this document useful (0 votes)
4 views79 pages

Deep Learning Unit 2

The document discusses Feed-Forward Neural Networks, focusing on the RMSPROP optimizer and the backpropagation algorithm for training neural networks. It highlights the advantages and disadvantages of neural networks, including their ability to handle noisy data and the challenges of interpretability and training time. Additionally, it covers Principal Component Analysis (PCA) for dimensionality reduction and the Singular Value Decomposition (SVD) technique, explaining their applications in data analysis and machine learning.

Uploaded by

Rahul Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views79 pages

Deep Learning Unit 2

The document discusses Feed-Forward Neural Networks, focusing on the RMSPROP optimizer and the backpropagation algorithm for training neural networks. It highlights the advantages and disadvantages of neural networks, including their ability to handle noisy data and the challenges of interpretability and training time. Additionally, it covers Principal Component Analysis (PCA) for dimensionality reduction and the Singular Value Decomposition (SVD) technique, explaining their applications in data analysis and machine learning.

Uploaded by

Rahul Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

Deep Learning

Unit 2
ELECTIVE-VIII (BTCOE705 (B))
Feed-Forward Neural Networks
RMSPROP Optimizer for Gradient Descent
• RMSPROP [ Root Mean Square Propagation]
• Extension to the gradient descent optimization algorithm
zero
Classification based on Backpropagation
• Neural Networks

• Disadvantages of Neural Networks


• Long time for training network
• Interpretability less for humans

• Advantages of Neural Networks


• Highly tolerable for Noisy data (unknown data, data with unmatching
patterns, impure data etc)
• Good for unrecognized patterns (the patterns not yet been trained that can
also be recognized)
• Backpropagation is an algorithm which works in an iterative nature

I/P Hidden O/P Layer


Layer Layer

• Here if the actual output is not the desired one then we backpropagate
to the hidden layer and check with the updated weights and re-
compute the output till the desired output is produced.
Backpropagation algorithm contains
• D: Dataset
• T: Target attributes
• w: weights
• B: bias
Backpropagation algorithm works in two
stages
• Two layer Network

I/p Hidd O/P


en
• Feed-forward Network

Hidd Hidd O/p


I/p en en

• There can be cross connection between layers


• Fully Connected Network
Backpropagation Algorithm
• First is the Input layer
• There should be some output from the input layer
• Let I represents the input layer and Ii represents an input of this layer
• Let Oi represents output of each input of the input layer
• Oi = Ii
Steps in the Backpropagation algorithm
A. Initialize all the weights and bias
B. Repeat while terminating condition is reached
1. First compute the outputs of all the inputs / input layer.
2. Then compute inputs for the second layer i.e hidden layer Ij ,

So, Ij = (∑ wij * Oi ) + bj
3. Now compute output of hidden layer ie., Oj
Oj = 1 / (1 + e-IJ )
4. So, this way we can compute output for as many hidden layers as
we want.
5. Now same way compute input for the output layer k, ie., Ik
So, Ik = (∑ wjk * Oj ) + bk

6. Now compute output of the output layer ie., Ok


Ok = 1 / (1 + e-Ik )

Now, first cycle of output is computed. We need to go for weight reinitialization or


weight update based on error as follows.
C. Calculating error for the kth layer i.e output layer.

• Ek = Oj (1 - Oj )(Tj - Oj )

D. Similarly, calculate the error factor for the jth layer i.e., hidden layer.

• Ej = Oj (1 - Oj ) ∑wjk * Ek

E. Then we go for weight updation


F. At last we go for bias updating as follows:
Principal Components Analysis ( PCA)
• An exploratory technique used to reduce the
dimensionality of the data set to 2D or 3D
• Can be used to:
• Reduce number of dimensions in data
• Find patterns in high-dimensional data
• Visualize data of high dimensionality
• Example applications:
• Face recognition
• Image compression
• Gene expression analysis

41
Principal Components Analysis Ideas ( PCA)

• Does the data set ‘span’ the whole of d dimensional space?


• For a matrix of m samples x n genes, create a new covariance matrix
of size n x n.
• Transform some large number of variables into a smaller number of
uncorrelated variables called principal components (PCs).
• developed to capture as much of the variation in data as possible

42
Principal Component Analysis
◼See online tutorials such as
http://www.cs.otago.ac.nz/cosc453/student_tutorials/princi
pal_components.pdf X2

Y1

Y2
x
x
x
Note: Y1 is the x xx
x x
first eigen vector, x
x x
Y2 is the second. x
x
Y2 ignorable. x x
x x
x x X1
x x Key observation:
x x
x x variance = largest!

43
Principal Component Analysis: one
attribute first Temperature
42
40
• Question: how much 24
spread is in the data 30
along the axis? 15

(distance to the mean) 18


15
• Variance=Standard 30
deviation^2 15
n

(Xi − X ) 2 30
35
s =
2 i =1

(n − 1) 30
40
30
44
Now consider two dimensions
X=Temperature Y=Humidity
Covariance: measures the 40 90
correlation between X and Y
• cov(X,Y)=0: independent 40 90
•Cov(X,Y)>0: move same dir 40 90
•Cov(X,Y)<0: move oppo dir 30 90
15 70
15 70
15 70
30 90
n


15 70
( X i − X )(Yi − Y )
i =1 30 70
cov(X , Y ) =
(n − 1) 30 70
30 90
40 70 45
More than two attributes: covariance
matrix
• Contains covariance values between all possible
dimensions (=attributes):

C nxn
= (cij | cij = cov( Dimi , Dim j ))
• Example for three attributes (x,y,z):

 cov( x, x) cov( x, y ) cov( x, z ) 


 
C =  cov( y, x) cov( y, y ) cov( y, z ) 
 cov( z , x) cov( z , y ) cov( z , z ) 
 
46
Eigenvalues & eigenvectors
• Vectors x having same direction as Ax are called
eigenvectors of A (A is an n by n matrix).
• In the equation Ax=x,  is called an eigenvalue of A.

 2 3   3  12   3
  x  =   = 4 x 
 2 1  2  8   2

47
Eigenvalues & eigenvectors

• Ax=x  (A-I)x=0
• How to calculate x and :
• Calculate det(A-I), yields a polynomial (degree n)
• Determine roots to det(A-I)=0, roots are eigenvalues 
• Solve (A- I) x=0 for each  to obtain eigenvectors x

48
Principal components
• 1. principal component (PC1)
• The eigenvalue with the largest absolute value will indicate
that the data have the largest variance along its
eigenvector, the direction along which there is greatest
variation
• 2. principal component (PC2)
• the direction with maximum variation left in data,
orthogonal to the 1. PC
• In general, only few directions manage to capture
most of the variability in the data.

49
Steps of PCA
• Let Xbe the mean
• For matrix C, vectors e
vector (taking the mean (=column vector) having
of all rows) same direction as Ce :
• Adjust the original data • eigenvectors of C is e such
that Ce=e,
by the mean •  is called an eigenvalue of
X’ = X – C.
X
• Compute the covariance • Ce=e  (C-I)e=0
matrix C of adjusted X
• Most data mining packages
• Find the eigenvectors do this for you.
and eigenvalues of C.

50
Eigenvalues
• Calculate eigenvalues  and eigenvectors x for
covariance matrix:
• Eigenvalues j are used for calculation of [% of total variance]
(Vj) for each component j:

j n
V j = 100  n  x =n
 x
x =1
x =1

51
Principal components - Variance

25

20

Variance (%) 15

10

0
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10

52
Transformed Data
• Eigenvalues j corresponds to variance on each
component j
• Thus, sort by j
• Take the first p eigenvectors ei; where p is the number of
top eigenvalues
• These are the directions with the largest variances
 yi1   e1  xi1 − x1 
    
 yi 2   e2  xi 2 − x2 
 ...  =  ...  
    ... 
 y   e  x − x 
 ip   p  in n  53
An Example Mean1=24.1
Mean2=53.8

X1 X2 X1' X2' 100


90
80
70
60
19 63 -5.1 9.25 50 Series1
40
30
20
39 74 14.9 20.25 10
0
0 10 20 30 40 50

30 87 5.9 33.25
40

30
30 23 5.9 -30.75 20

10

15 35 -9.1 -18.75 -15 -10 -5


0
0 5 10 15 20
Series1

-10

-20

15 43 -9.1 -10.75 -30

-40

15 32 -9.1 -21.75
54
Covariance Matrix
75 106
• C=
106 482

• Using MATLAB, we find out:


• Eigenvectors:
• e1=(-0.98,-0.21), 1=51.8
• e2=(0.21,-0.98), 2=560.2
• Thus the second eigenvector is more important!

55
If we only keep one dimension: e2
0.5
yi
0.4
-10.14
0.3

• We keep the dimension 0.2


0.1
-16.72
-31.35
of e2=(0.21,-0.98) 0
31.374
-40 -20 -0.1 0 20 40
• We can obtain the final -0.2
16.464
-0.3 8.624
data as -0.4 19.404
-0.5
-17.63

 xi1 
yi = (0.21 − 0.98)  = 0.21* xi1 − 0.98 * xi 2
 xi 2 

56
57
58
Applications – Gene expression analysis
• Reference: Raychaudhuri et al. (2000)
• Purpose: Determine core set of conditions for useful
gene comparison
• Dimensions: conditions, observations: genes
• Yeast sporulation dataset (7 conditions, 6118 genes)
• Result: Two components capture most of variability (90%)
• Issues: uneven data intervals, data dependencies
• PCA is common prior to clustering
• Crisp clustering questioned : genes may correlate with
multiple clusters
• Alternative: determination of gene’s closest neighbours

59
Singular Value Decomposition
Underconstrained Least Squares

• What if you have fewer data points than parameters in your


function?
– Intuitively, can’t do standard least squares
– Recall that solution takes the form ATAx = ATb
– When A has more columns than rows,
ATA is singular: can’t take its inverse, etc.
Underconstrained Least Squares

• More subtle version: more data points than unknowns, but data
poorly constrains function
• Example: fitting to y=ax2+bx+c
Underconstrained Least Squares

• Problem: if problem very close to singular,


roundoff error can have a huge effect
– Even on “well-determined” values!
• Can detect this:
– Uncertainty proportional to covariance C = (ATA)-1
– In other words, unstable if ATA has small values
– More precisely, care if xT(ATA)x is small for any x
• Idea: if part of solution unstable, set answer to 0
– Avoid corrupting good parts of answer
Singular Value Decomposition (SVD)

• Handy mathematical technique that has application to many


problems
• Given any mn matrix A, algorithm to find matrices U, V, and W
such that
A = U W VT
U is mn and orthonormal
W is nn and diagonal
V is nn and orthonormal
SVD

   
    T
    w1 0 0  
 =  0  0  
  
A U V
  
    0 0 wn  

   
   

• Treat as black box: code widely available


In Matlab: [U,W,V]=svd(A,0)
SVD

• The wi are called the singular values of A


• If A is singular, some of the wi will be 0
• In general rank(A) = number of nonzero wi
• SVD is mostly unique (up to permutation of singular values, or if
some wi are equal)
SVD and Inverses

• Why is SVD so useful?


• Application #1: inverses
• A-1=(VT)-1 W-1 U-1 = V W-1 UT
– Using fact that inverse = transpose
for orthogonal matrices
– Since W is diagonal, W-1 also diagonal with reciprocals
of entries of W
SVD and Inverses

• A-1=(VT)-1 W-1 U-1 = V W-1 UT


• This fails when some wi are 0
– It’s supposed to fail – singular matrix
• Pseudoinverse: if wi=0, set 1/wi to 0 (!)
– “Closest” matrix to inverse
– Defined for all (even non-square, singular, etc.)
matrices
– Equal to (ATA)-1AT if ATA invertible
SVD and Least Squares

• Solving Ax=b by least squares


• x=pseudoinverse(A) times b
• Compute pseudoinverse using SVD
– Lets you see if data is singular
– Even if not singular, ratio of max to min singular values (condition number)
tells you how stable the solution will be
– Set 1/wi to 0 if wi is small (even if not exactly 0)
SVD and Eigenvectors

• Let A=UWVT, and let xi be ith column of V


• Consider ATA xi: 0  0 
   
   
2   2
A Axi = VW U UWV xi = VW V xi = VW 1 = V wi = wi xi
T T T T 2 T 2
   
   
0  0 
   
• So elements of W are sqrt(eigenvalues) and
columns of V are eigenvectors of ATA
– What we wanted for robust least squares fitting!
Summary of Singular Value Decomposition

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy