0% found this document useful (0 votes)

123 views59 pages

Module2 Lecture 6 Cat1 UptoSmoothGrad

The document discusses different methods for propagation-based explanations in deep learning models. It reviews backpropagation and then describes calculating gradients with respect to the model inputs rather than parameters. It introduces vanilla gradients, SmoothGrad, Grad x Input, and Integrated Gradients as variations for computing feature importance.

Uploaded by

Next Einstein

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

123 views59 pages

Module2 Lecture 6 Cat1 UptoSmoothGrad

Uploaded by

Next Einstein

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

Propagation-based

explanations
CSEP 590B: Explainable AI
Ian Covert & Su-In Lee
University of Washington

©2022 Su-In Lee 1

Previously
§ Feature importance explanations
§ Removal-based explanations
§ Shapley values

§ Today: propagation-based explanations

©2022 Su-In Lee 2

Today
§ Section 1
§ Backprop review
§ Gradient-based explanations
§ Section 2
§ Modified backprop variants
§ Propagation vs. removal-based explanations

©2022 Su-In Lee 3

Setup
§ Consider a classification model 𝑓 𝑥
§ 𝑓! 𝑥 is probability for class 𝑦
§ Input is 𝑥 ∈ ℝ" (must be continuous)
§ Assume 𝑓 is differentiable (a neural network)

𝑥1 -
𝑥2 -
: 𝑓 ⋅ 𝑦"
𝑥𝑑 -

©2022 Su-In Lee 4

Review: backpropagation
§ Deep learning models are trained using
stochastic gradient descent (SGD)
§ Get a minibatch of examples
§ Calculate predictions and loss
§ Calculate gradients using “backprop” algorithm

©2022 Su-In Lee 5

Backpropagation

𝑥 ℎ! ℎ" 𝑦#

Input layer Output layer

Network weights

©2022 Su-In Lee 6

Backpropagation

𝑥 ℎ! ℎ" 𝑦#

Gradient descent

𝜃&'! = 𝜃& − 𝜂∇( ℒ(𝜃& )

Network parameters Forward pass

𝜃= 𝑤! , 𝑏! , 𝑤" , 𝑏" , 𝑤#$% , 𝑏#$% ℎ! = 𝜎 𝑤! 𝑥 + 𝑏!

ℎ" = 𝜎 𝑤" ℎ! + 𝑏"
©2022 Su-In Lee
𝑦# = 𝑤#$% ℎ" + 𝑏#$% 7
Backpropagation (cont.)
§ Model loss is mean prediction error:
5
1
ℒ 𝜃 = ) ℓ 𝑓 𝑥2; 𝜃 , 𝑦2
𝑛
234

§ Gradient calculation:
5
1
∇6 ℒ 𝜃 = ) ∇6 ℓ 𝑓 𝑥 2 ; 𝜃 , 𝑦 2
𝑛
234

©2022 Su-In Lee 8

Chain rule
§ Calculate gradients for all parameters 𝜃 using
chain rule
§ Get gradients for last hidden layer
§ Then for the previous layer
§ Then the layer before that…

§ Backpropagation = propagating gradients

backward through the network

Rumelhart et al., “Learning representations by back-propagating errors” (1986)

©2022 Su-In Lee 9

Chain rule (cont.)
𝑥 ℎ! ℎ" 𝑦#

𝜕ℒ 𝜕ℒ
ℒ 𝜃 = ( ℓ 𝑦,
* 𝑦
𝜕ℎ% 𝜕ℎ!

𝜕ℒ 𝜕ℒ 𝜕ℒ
𝜕𝑤% 𝜕𝑤! 𝜕𝑤"#$

𝜕ℒ 𝜕ℒ 𝜕ℒ
𝜕𝑏% 𝜕𝑏! 𝜕𝑏"#$
©2022 Su-In Lee 10
Propagation-based
explanations

§ Use backprop idea to quantify feature

importance

§ Rather than gradients w.r.t. parameters,

calculate gradients w.r.t. inputs

©2022 Su-In Lee 11

Input gradients
𝑥 ℎ! ℎ" 𝑦#

𝜕ℒ 𝜕ℒ
ℒ 𝜃 = ( ℓ 𝑦,
* 𝑦
𝜕ℎ% 𝜕ℎ!

𝜕ℒ 𝜕ℒ 𝜕ℒ 𝜕ℒ
𝜕𝑥 𝜕𝑤% 𝜕𝑤! 𝜕𝑤"#$

𝜕ℒ 𝜕ℒ 𝜕ℒ
𝜕𝑏% 𝜕𝑏! 𝜕𝑏"#$
©2022 Su-In Lee 12
Input gradients
𝑥 ℎ! ℎ" 𝑦#

Gradient of prediction
𝜕𝑓& 𝜕𝑓&
(instead of loss) 𝑓& 𝑥; 𝜃 = 𝑦*
𝜕ℎ% 𝜕ℎ!

𝜕𝑓& 𝜕𝑓& 𝜕𝑓& 𝜕𝑓&

𝜕𝑥 𝜕𝑤% 𝜕𝑤! 𝜕𝑤"#$

w.r.t. model input 𝜕𝑓& 𝜕𝑓& 𝜕𝑓&

𝜕𝑏% 𝜕𝑏! 𝜕𝑏"#$
©2022 Su-In Lee 13
Intuition
§ Partial derivatives represent sensitivity to small
perturbations

Delta from small change in 𝑖th direction

§ Mathematically:

𝜕𝑓7 𝑓7 𝑥 + 𝑒2 ⋅ 𝜖 − 𝑓7 𝑥
𝑥 = lim
𝜕𝑥2 8→: 𝜖

Limit as change becomes very small Measured relative to the size of change

©2022 Su-In Lee 14

Today
§ Section 1
§ Backprop review
§ Gradient-based explanations
§ Section 2
§ Modified backprop variants
§ Propagation vs. removal-based explanations

©2022 Su-In Lee 15

Application to XAI
§ Idea: find features that cause large output
changes when perturbed

§ Remark: quantifies feature sensitivity, but not

necessarily related to feature removal

©2022 Su-In Lee 16

Vanilla gradients
§ For an input 𝑥 and label 𝑦, calculate gradient of
the prediction 𝑓! 𝑥 :
𝜕𝑓7
𝑎2 = 𝑥
𝜕𝑥2

§ Can optionally use absolute value:

𝜕𝑓7
𝑎2 = 𝑥
𝜕𝑥2

Simonyan et al., “Deep inside convolutional networks: Visualizing image classification

models and saliency maps” (2014)

©2022 Su-In Lee 17

Vanilla gradients (cont.)

©2022 Su-In Lee 18

Variation 1: SmoothGrad
§ Average gradients across inputs near 𝑥
§ E.g., add Gaussian noise:

<=0
𝑎2 = 𝔼 𝑥+𝜖 where 𝜖 ∼ 𝒩 0, 𝜎 ?
<>1

§ In practice, use small number of 𝜖 samples (50)

§ Must tune 𝜎 ? to an appropriate level

Smilkov et al., “SmoothGrad: Removing noise by adding noise” (2017)

©2022 Su-In Lee 19

SmoothGrad (cont.)

Standard saliency maps Varying levels of input noise

©2022 Su-In Lee 20

Variation 2: Grad x Input
§ Multiply gradient by input values:
𝜕𝑓!
𝑎# = 𝑥# ⋅ 𝑥
𝜕𝑥#

Shrikumar et al., “Not just a black box: Learning important features through propagating
activation differences” (2016)

©2022 Su-In Lee 21

Grad x Input (cont.)
§ Interpretation: consider the model’s first-order
Taylor expansion around 𝑥$

@
𝜕𝑓7
𝑓7 𝑥 ≈ 𝑓7 𝑥: + 𝑥 − 𝑥: 𝑥:
𝜕𝑥

§ Gradient gives linearized version of model (like

replacing a function with its tangent line)
§ Grad x Input = approximates impact of setting
the input to zero
§ Similar to occlusion (see previous lecture)

©2022 Su-In Lee 22

Variation 3: Integrated
gradients
§ Gradients can become “saturated”
§ Model is sensitive to big input changes, but not small ones

In this region, 𝑓 𝑥 is
insensitive to small 𝑥
changes

Gradient ≈ 0

§ Saturation can yield small gradients, even for important inputs

Sundararajan et al., “Axiomatic attribution for deep networks” (2017)

©2022 Su-In Lee 23

IntGrad (cont.)
§ Idea: address saturation issue by calculating gradients
for rescaled images, 𝛼 ⋅ 𝑥
!"!
𝛼 ⋅ 𝑥 for 0 ≤ 𝛼 ≤ 1
!#"

§ Integrate (average) gradients across range of rescaled

images:
' 𝜕𝑓(
' 𝛼 ⋅ 𝑥 𝑑𝛼
$%& 𝜕𝑥)

§ Multiply by the input feature value:

' 𝜕𝑓(
𝑎) = 𝑥) ' 𝛼 ⋅ 𝑥 𝑑𝛼
$%& 𝜕𝑥)

©2022 Su-In Lee 24

IntGrad (cont.)
§ Implicitly relies on a zeros baseline
§ Can instead use a non-zero baseline 𝑥′
) 𝜕𝑓* %
𝑎$ = 𝑥$ − 𝑥$% ' 𝑥 + 𝛼 ⋅ 𝑥 − 𝑥% 𝑑𝛼
&'( 𝜕𝑥$

§ Related to a different idea from cooperative game

theory: the Aumann-Shapley value
§ Different from previous Shapley value
§ Has its own axiomatic derivation (see Sundararajan et al.)

©2022 Su-In Lee 25

IntGrad (cont.)
§ Problem: the integral is hard to calculate

§ Solution: use Riemann sum approximation for 𝑚

regularly spaced values 𝛼D ∈ 0,1 :

,
1 𝜕𝑓* %
𝑎$ ≈ 𝑥$ − 𝑥$% / 𝑥 + 𝛼+ ⋅ 𝑥 − 𝑥 %
𝑚 𝜕𝑥$
+')

©2022 Su-In Lee 26

IntGrad (cont.)

Gradient with image

rescaled by 𝛼

©2022 Su-In Lee 27

IntGrad (cont.)

©2022 Su-In Lee 28

GradCAM
§ In CNNs, hidden layers represent high-level
visual concepts
§ Hidden layers retain spatial information due to
convolutional structure

§ Idea: explain models via the last convolutional

layer instead of the input layer

Selvaraju et al., “Grad-CAM: Visual explanations from deep networks via gradient-based
localization” (2017)

©2022 Su-In Lee 29

CNN receptive fields

Receptive field grows at

each layer, but remains
localized

©2022 Su-In Lee 30

GradCAM procedure
§ Denote final layer’s hidden representation as 𝐴
§ Size is 𝐴 ∈ ℝ-×/×0
§ Width 𝑤, height ℎ, channels 𝑐
§ Each channel 𝑘 = 1, … , 𝑐 denoted as 𝐴1 ∈ ℝ-×/

§ The final prediction 𝑓! 𝑥 can be viewed as a

function of representation 𝐴
§ E.g., 𝐴 → global average pooling → MLP

©2022 Su-In Lee 31

GradCAM procedure (cont.)

𝑦-

𝑓( 𝑥

©2022 Su-In Lee 32

GradCAM procedure (cont.)
§ Calculate gradients w.r.t. 𝐴
23)
for all 𝑖, 𝑗, 𝑘
4,
*+

©2022 Su-In Lee 33

GradCAM procedure (cont.)
§ Often use thresholding function (suppress negative
attributions):
,
(
𝑎)+ = ReLU 2 𝛼* 𝐴*)+
*%'

§ Can optionally upsample low-resolution scores 𝑎)+ to the

original input size (e.g., bilinear upsampling)

©2022 Su-In Lee 34

GradCAM interpretation
7
§ The values 𝛼F represent smoothed or averaged
gradient of class 𝑦 w.r.t. channel 𝑘

§ At each location, activations 𝐴F2D are multiplied by

averaged gradients and then aggregated

§ Similar to Grad x Input, but using a hidden layer

instead of input layer
§ Like a Taylor approximation of setting internal activations
to zero

©2022 Su-In Lee 35

GradCAM results

©2022 Su-In Lee 36

Other gradient-based
methods
§ Guided backprop: Springenberg et al., “Striving for simplicity: The all-
convolutional net” (2014)
§ VarGrad: Adebayo et al., “Local explanation methods for deep neural
networks lack sensitivity to parameter values” (2018)
§ Expected gradients: Erion et al., “Learning explainable models using
attribution priors” (2020)
§ BlurIG: Xu et al., “Attribution in scale and space” (2020)

©2022 Su-In Lee 37

Today
§ Section 1
§ Backprop review
§ Gradient-based explanations
§ 10 min break
§ Section 2
§ Modified backprop variants
§ Propagation vs. removal-based explanations

©2022 Su-In Lee 38

Propagation-based
explanations
(continued)
CSEP 590B: Explainable AI
Ian Covert & Su-In Lee
University of Washington

©2022 Su-In Lee 39

Today
§ Section 1
§ Backprop review
§ Gradient-based explanations
§ Section 2
§ Modified backprop variants
§ Propagation vs. removal-based explanations

©2022 Su-In Lee 40

Modified backpropagation
§ Previous approaches rely on gradient backprop
§ Others use heuristic backprop variants
§ Calculate “importance” of internal nodes, propagate
back to earlier ones
§ Requires justification for different backprop
heuristics

©2022 Su-In Lee 41

Layer-wise relevance
propagation (LRP)

§ Intuition: iteratively calculate relevance scores

for every layer of a model
§ Start with nodes in the last hidden layer
§ Move backwards through the model by splitting
scores in the previous layer

Bach et al., “On pixel-wise explanations for non-linear classifier decisions by layer-wise
relevance propagation” (2015)

LRP (cont.)
56)
§ Let 𝑅+ ∈ ℝ denote relevance for 𝑗th node in layer 𝑙 + 1
(8)
§ Initialize 𝑅) = 𝑓* 𝑥 for output node of interest
5,56)
§ Let 𝑅$←+ denote relevance message sent from 𝑗th node
in layer 𝑙 + 1 to 𝑖th node in layer 𝑙
§ Want messages to satisfy two conservation properties

56) 5,56)
𝑅+ = / 𝑅$←+ Summation of outgoing
$ importance

(5) 5,56)
Summation of incoming 𝑅$ = / 𝑅$←+
importance +

LRP (cont.)
§ The previous rules don’t define a unique procedure, so
the authors propose multiple options
§ For example, the “𝜖-rule”
5,56) 𝑧$+ 56)
𝑅$←+ = 𝑅+
𝑧+ + 𝜖 ⋅ sign(𝑧+ )

56) 5 56)
where 𝑧$+ = 𝑤$+ ℎ$ , 𝑧+ = ∑$ 𝑧$+ + 𝑏+ , and 𝜖 > 0

)
§ Finally, attributions are given by 𝑎$ = 𝑅$ for 𝑖 = 1, … , 𝑑

LRP (cont.)

LRP discussion
§ Arguably less intuitive than other methods,
requires some heuristic choices (which “rule” to
use?)
§ Can be difficult to adapt to different
architectures
§ E.g., does not automatically support residual
connections (ResNet architecture), requires extension
to transformers

Other modified
backpropagation methods
§ DeepLIFT: Shrikumar et al., “Not just a black box: Learning important
features through propagating activation differences” (2016)
§ PatternAttribution: Kindermans et al., “Learning how to explain neural
networks: PatternNet and PatternAttribution” (2017)
§ A unifying perspective: Ancona et al., “Towards better understanding of
gradient-based attribution methods” (2017)
§ Excitation backprop: Zhang et al., ” Top-down neural attention by
excitation backprop” (2018)
§ LRP variants: Montavon et al., “Layer-wise relevance propagation: an
overview” (2019)

Today
§ Section 1
§ Backprop review
§ Gradient-based explanations
§ Section 2
§ Modified backprop variants
§ Propagation vs. removal-based explanations

Many explanation methods
§ Removal-based explanations
§ SHAP, LIME, RISE, Occlusion, permutation tests

§ Propagation-based explanations
§ SmoothGrad, IntGrad, GradCAM

What should we use in practice?

Model flexibility

What kind of model are you explaining?

§ Removal-based explanations are model-agnostic

§ Can work with any model class (DNNs, trees, etc.)

§ Propagation-based explanations are mainly for neural

networks
§ Usually require differentiation
§ Some even have architecture constraints

Data flexibility
What kind of data do you have?

§ Removal-based explanations can handle discrete and

continuous features
§ E.g., replace inputs with alternative values from the dataset

§ Propagation-based methods only make sense for

continuous features
§ Derivative = sensitivity to small change
§ Small changes are meaningless for discrete features
§ E.g., 𝑥# ∈ {0, 1, 2}

Local or global
What kind of explanation do you need?

§ Both removal- and propagation-based methods can

produce local explanations

§ Removal-based methods are better suited for global

explanations
§ Can focus on a global model behavior (e.g., dataset loss)
§ To use a propagation-based method, we require an aggregation
scheme (e.g., mean of local explanations)

Speed
Is speed important?

§ Propagation-based methods are fast

§ Backward pass through DNN
§ Weak dependence on number of features

§ Removal-based methods can be slow

§ Often require making predictions with many feature subsets
§ Shapley values are particularly challenging

Quality
Which explanation is most informative or
correct?

§ Theory can serve as a guide

§ E.g., Shapley value axioms, IntGrad axioms
§ We can also take an empirical approach
§ Metrics for explanation quality (next lecture)

§ Perspective: no explanation is wrong, but some

procedures are misaligned with user questions

Popular methods
Which methods do most people use?

§ A small number of methods dominate

§ Depends on the data domain (tabular, image, NLP)

Tabular data
§ Permutation tests are widely used for global
feature importance

§ SHAP is ubiquitous for local explanations

§ TreeSHAP is built into XGBoost, LGBM
§ KernelSHAP used for other models

Computer vision
§ Gradient-based methods are currently most
popular: GradCAM, IntGrad
§ Removal-based methods are usually too slow
§ Some papers try to fix this, but not popular (yet?)
§ Masking model: Dabkowski & Gal, “Real time image saliency for black box
classifiers” (2017)
§ CXPlain: Schwab & Karlen, “CXPlain: Causal explanations for model
interpretation under uncertainty” (2019)
§ FastSHAP: Jethani et al., “FastSHAP: Real-time Shapley value estimation”
(2021)

NLP
§ NLP models (LSTMs, transformers) can use
most methods
§ Gradient-based methods are popular
§ Removal-based explanations are slower, but leave-
one-out (occlusion) is sometimes used

§ For transformers, some use attention as

explanation
§ Perhaps an interpretable architecture?
§ We’ll return to this in a later lecture

Popular packages

GitHub package Description Stars

SHAP variations (KernelSHAP,
slundberg/shap 15.8k
TreeSHAP, DeepSHAP, etc.)
marcotcr/lime LIME for images, tabular data 9.7k

utkuozbulak/pytorch-cnn-visualizations Various gradient-based methods 6.4k

jacobgil/pytorch-grad-cam GradCAM + GradCAM variations 4.5k

Various gradient-based methods
pytorch/captum 3.0k
+ SHAP
Various gradient-based methods
sicara/tf-explain 906
+ occlusion
kundajelab/deeplift DeepLIFT 616

ankurtaly/Integrated-Gradients IntGrad 453

SOIL MECHANICS AND FOUNDATION ENGINEERING BY DR K.R. ARORA - Civilenggforall PDF
88% (50)
SOIL MECHANICS AND FOUNDATION ENGINEERING BY DR K.R. ARORA - Civilenggforall PDF
903 pages
Arcadia 18 Color
100% (4)
Arcadia 18 Color
49 pages
Portfolio Advanced 11
No ratings yet
Portfolio Advanced 11
24 pages
Alice Book Volume 1
No ratings yet
Alice Book Volume 1
281 pages
Air compressorSS15HN
50% (2)
Air compressorSS15HN
50 pages
Lbdlu
No ratings yet
Lbdlu
168 pages
Im in Love With The Villainess Shes So Cheeky For A Commoner Light Novel Vol 1 Inori Instant Download
No ratings yet
Im in Love With The Villainess Shes So Cheeky For A Commoner Light Novel Vol 1 Inori Instant Download
82 pages
Ups - Eaton 9355
100% (3)
Ups - Eaton 9355
120 pages
Masonic HUMOR 1
100% (1)
Masonic HUMOR 1
30 pages
Alice Book Volume 1
No ratings yet
Alice Book Volume 1
378 pages
BOSH - Lecture 9 - Personal Protective Equipment-Merged
No ratings yet
BOSH - Lecture 9 - Personal Protective Equipment-Merged
59 pages
Theory DL
No ratings yet
Theory DL
227 pages
DL Unit2
No ratings yet
DL Unit2
113 pages
Deep Learning Book Part1
No ratings yet
Deep Learning Book Part1
100 pages
Learning 3
No ratings yet
Learning 3
98 pages
Optimization of Deep Networks
No ratings yet
Optimization of Deep Networks
84 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
100 pages
Gardner Denver Air Compressor Parts Catalog
No ratings yet
Gardner Denver Air Compressor Parts Catalog
113 pages
ML Unit - 2
No ratings yet
ML Unit - 2
70 pages
5.scaling Optimization
No ratings yet
5.scaling Optimization
68 pages
L7 Lecture Image - classification.DNN v4
No ratings yet
L7 Lecture Image - classification.DNN v4
61 pages
Ml2 Script v2
No ratings yet
Ml2 Script v2
123 pages
Lec 5 Scaling and Opt
No ratings yet
Lec 5 Scaling and Opt
68 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
48 pages
Chapter 4 Conics
No ratings yet
Chapter 4 Conics
55 pages
Chapter 11 Neural Nets (Python)
No ratings yet
Chapter 11 Neural Nets (Python)
43 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Ai - W7L13
No ratings yet
Ai - W7L13
46 pages
Brochure PT. Total Prime Engineering
No ratings yet
Brochure PT. Total Prime Engineering
16 pages
ANN Presentation Exam Hafsa
No ratings yet
ANN Presentation Exam Hafsa
29 pages
Deep Neural Networks - 2
No ratings yet
Deep Neural Networks - 2
55 pages
18 DL Regularization
No ratings yet
18 DL Regularization
41 pages
DLbook
No ratings yet
DLbook
165 pages
Lecture 17-Backpropagation
No ratings yet
Lecture 17-Backpropagation
28 pages
Contemporary ML For Physicists
No ratings yet
Contemporary ML For Physicists
91 pages
Inference and Learning
No ratings yet
Inference and Learning
33 pages
MUG SHOT PUNO Cafe Presentation 1
No ratings yet
MUG SHOT PUNO Cafe Presentation 1
21 pages
Action Plan Freight Transport and Logistic
No ratings yet
Action Plan Freight Transport and Logistic
48 pages
Index
No ratings yet
Index
127 pages
Deep Learning Module-02 Search Creators
No ratings yet
Deep Learning Module-02 Search Creators
15 pages
UNIT 1 Introduction Part 1
No ratings yet
UNIT 1 Introduction Part 1
37 pages
Alice's Adventures in A Differentiable Wonderland
No ratings yet
Alice's Adventures in A Differentiable Wonderland
279 pages
CT1 DL Ans
No ratings yet
CT1 DL Ans
13 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
Interpretable Explanations of Black Boxes by Meaningful Perturbation
No ratings yet
Interpretable Explanations of Black Boxes by Meaningful Perturbation
9 pages
Introai Last Edit
No ratings yet
Introai Last Edit
11 pages
WINSEM2023-24 BSTS302P SS CH2023240500139 Reference Material I 17-01-2024 L6 - The Celebrity Problem
No ratings yet
WINSEM2023-24 BSTS302P SS CH2023240500139 Reference Material I 17-01-2024 L6 - The Celebrity Problem
8 pages
Foundations of Deep Learning
No ratings yet
Foundations of Deep Learning
30 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
Sort Without Extra Space
No ratings yet
Sort Without Extra Space
11 pages
BUS101 Presentation: Group Name: Group Members
No ratings yet
BUS101 Presentation: Group Name: Group Members
18 pages
Priority Queue
No ratings yet
Priority Queue
10 pages
Computing Gradient Using Backpropagation: ZV0GDF798E
No ratings yet
Computing Gradient Using Backpropagation: ZV0GDF798E
5 pages
Lecture Notes 3 &4
No ratings yet
Lecture Notes 3 &4
35 pages
Iterative Tower of Hanoi
No ratings yet
Iterative Tower of Hanoi
9 pages
Minimum Stack
No ratings yet
Minimum Stack
9 pages
Lecture8 DeepLearning
No ratings yet
Lecture8 DeepLearning
94 pages
Celebrity Problem
No ratings yet
Celebrity Problem
8 pages
Ai-Enabled Damage Assessment Whitepaper
No ratings yet
Ai-Enabled Damage Assessment Whitepaper
16 pages
Stack Permutation
No ratings yet
Stack Permutation
7 pages
Stock Span
No ratings yet
Stock Span
7 pages
Deep Learning Notes
No ratings yet
Deep Learning Notes
155 pages
Lec14 CNNRNNModels
No ratings yet
Lec14 CNNRNNModels
64 pages
Deep Neural Network (DNN)
100% (1)
Deep Neural Network (DNN)
80 pages
Italian Wedding Soup With Turkey Meatballs Recipe 2
No ratings yet
Italian Wedding Soup With Turkey Meatballs Recipe 2
2 pages
Unit 2 - ML
No ratings yet
Unit 2 - ML
18 pages
Shit Pitt A Case Report
No ratings yet
Shit Pitt A Case Report
7 pages
ELeventh Physics Textbook SCERt
No ratings yet
ELeventh Physics Textbook SCERt
320 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
Summative Test-2nd Quarter 2022
No ratings yet
Summative Test-2nd Quarter 2022
4 pages
Jai Resume
No ratings yet
Jai Resume
3 pages
Multilayer Perceptrons and Backpropagation Learning: 1 Some History
No ratings yet
Multilayer Perceptrons and Backpropagation Learning: 1 Some History
6 pages
Assignment - 4
No ratings yet
Assignment - 4
24 pages
2023246032-Backward Propagation and Other Differential Algorithms
No ratings yet
2023246032-Backward Propagation and Other Differential Algorithms
48 pages
Verado L6 200-300 Gen5 & 350-400R Service Manual
86% (35)
Verado L6 200-300 Gen5 & 350-400R Service Manual
833 pages
Operation Difficulties at LP Carbamate Condenser Due To Crystallization
100% (1)
Operation Difficulties at LP Carbamate Condenser Due To Crystallization
6 pages
Convolutional Neural Network Tutorial
No ratings yet
Convolutional Neural Network Tutorial
8 pages
Practical 10: Presentation: Exploring Sikkim
No ratings yet
Practical 10: Presentation: Exploring Sikkim
9 pages
Logistics Standard Specifications Catalog en V16 20231101
No ratings yet
Logistics Standard Specifications Catalog en V16 20231101
15 pages
Outside Screw and Yoke (OS&Y) Gate Valve: Technical Features
No ratings yet
Outside Screw and Yoke (OS&Y) Gate Valve: Technical Features
2 pages
Surface Roughness - Wikipedia
No ratings yet
Surface Roughness - Wikipedia
9 pages
Essentials of Operations Management
No ratings yet
Essentials of Operations Management
3 pages
Minsky y Papert
No ratings yet
Minsky y Papert
77 pages
Unit-2 Improving-Deep-Neural-Networks
No ratings yet
Unit-2 Improving-Deep-Neural-Networks
18 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
Internal Ballistics: I. Lock Time Ii. Ignition Time Iii. Barrel Time
No ratings yet
Internal Ballistics: I. Lock Time Ii. Ignition Time Iii. Barrel Time
6 pages
XAI Final
No ratings yet
XAI Final
18 pages
Assignment 13 Modern AI
No ratings yet
Assignment 13 Modern AI
3 pages
An Introduction To Christian Theology: The Boisi Center Papers On Religion in The United States
No ratings yet
An Introduction To Christian Theology: The Boisi Center Papers On Religion in The United States
20 pages
BBBB
No ratings yet
BBBB
8 pages
Chapter 4 Disaster Management
No ratings yet
Chapter 4 Disaster Management
4 pages
TANKERS (70,000 - 120,000 DWT) FOR SALE: Crude Oil Tanker
100% (1)
TANKERS (70,000 - 120,000 DWT) FOR SALE: Crude Oil Tanker
5 pages
Deep Learning Notes
No ratings yet
Deep Learning Notes
110 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Module2 Lecture 6 Cat1 UptoSmoothGrad

Uploaded by

Module2 Lecture 6 Cat1 UptoSmoothGrad

Uploaded by

Propagation-based

©2022 Su-In Lee 1

§ Today: propagation-based explanations

©2022 Su-In Lee 2

©2022 Su-In Lee 3

©2022 Su-In Lee 4

©2022 Su-In Lee 5

Input layer Output layer

©2022 Su-In Lee 6

𝜃&'! = 𝜃& − 𝜂∇( ℒ(𝜃& )

𝜃= 𝑤! , 𝑏! , 𝑤" , 𝑏" , 𝑤#$% , 𝑏#$% ℎ! = 𝜎 𝑤! 𝑥 + 𝑏!

©2022 Su-In Lee 8

§ Backpropagation = propagating gradients

Rumelhart et al., “Learning representations by back-propagating errors” (1986)

©2022 Su-In Lee 9

§ Use backprop idea to quantify feature

§ Rather than gradients w.r.t. parameters,

©2022 Su-In Lee 11

𝜕𝑓& 𝜕𝑓& 𝜕𝑓& 𝜕𝑓&

w.r.t. model input 𝜕𝑓& 𝜕𝑓& 𝜕𝑓&

Delta from small change in 𝑖th direction

©2022 Su-In Lee 14

©2022 Su-In Lee 15

§ Remark: quantifies feature sensitivity, but not

©2022 Su-In Lee 16

§ Can optionally use absolute value:

Simonyan et al., “Deep inside convolutional networks: Visualizing image classification

©2022 Su-In Lee 17

©2022 Su-In Lee 18

§ In practice, use small number of 𝜖 samples (50)

Smilkov et al., “SmoothGrad: Removing noise by adding noise” (2017)

©2022 Su-In Lee 19

Standard saliency maps Varying levels of input noise

©2022 Su-In Lee 20

©2022 Su-In Lee 21

§ Gradient gives linearized version of model (like

©2022 Su-In Lee 22

§ Saturation can yield small gradients, even for important inputs

Sundararajan et al., “Axiomatic attribution for deep networks” (2017)

©2022 Su-In Lee 23

§ Integrate (average) gradients across range of rescaled

§ Multiply by the input feature value:

©2022 Su-In Lee 24

§ Related to a different idea from cooperative game

©2022 Su-In Lee 25

§ Solution: use Riemann sum approximation for 𝑚

©2022 Su-In Lee 26

Gradient with image

©2022 Su-In Lee 27

©2022 Su-In Lee 28

§ Idea: explain models via the last convolutional

©2022 Su-In Lee 29

Receptive field grows at

©2022 Su-In Lee 30

§ The final prediction 𝑓! 𝑥 can be viewed as a

©2022 Su-In Lee 31

©2022 Su-In Lee 32

§ Average gradients within each channel:

©2022 Su-In Lee 33

§ Can optionally upsample low-resolution scores 𝑎)+ to the

©2022 Su-In Lee 34

§ At each location, activations 𝐴F2D are multiplied by

§ Similar to Grad x Input, but using a hidden layer

©2022 Su-In Lee 35

©2022 Su-In Lee 36

©2022 Su-In Lee 37

©2022 Su-In Lee 38

©2022 Su-In Lee 39

©2022 Su-In Lee 40

©2022 Su-In Lee 41

§ Intuition: iteratively calculate relevance scores

©2022 Su-In Lee 42

©2022 Su-In Lee 43

©2022 Su-In Lee 44

©2022 Su-In Lee 45

©2022 Su-In Lee 46