0% found this document useful (0 votes)
123 views59 pages

Module2 Lecture 6 Cat1 UptoSmoothGrad

The document discusses different methods for propagation-based explanations in deep learning models. It reviews backpropagation and then describes calculating gradients with respect to the model inputs rather than parameters. It introduces vanilla gradients, SmoothGrad, Grad x Input, and Integrated Gradients as variations for computing feature importance.

Uploaded by

Next Einstein
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
123 views59 pages

Module2 Lecture 6 Cat1 UptoSmoothGrad

The document discusses different methods for propagation-based explanations in deep learning models. It reviews backpropagation and then describes calculating gradients with respect to the model inputs rather than parameters. It introduces vanilla gradients, SmoothGrad, Grad x Input, and Integrated Gradients as variations for computing feature importance.

Uploaded by

Next Einstein
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Propagation-based

explanations
CSEP 590B: Explainable AI
Ian Covert & Su-In Lee
University of Washington

©2022 Su-In Lee 1


Previously
§ Feature importance explanations
§ Removal-based explanations
§ Shapley values

§ Today: propagation-based explanations

©2022 Su-In Lee 2


Today
§ Section 1
§ Backprop review
§ Gradient-based explanations
§ Section 2
§ Modified backprop variants
§ Propagation vs. removal-based explanations

©2022 Su-In Lee 3


Setup
§ Consider a classification model 𝑓 𝑥
§ 𝑓! 𝑥 is probability for class 𝑦
§ Input is 𝑥 ∈ ℝ" (must be continuous)
§ Assume 𝑓 is differentiable (a neural network)

𝑥1 -
𝑥2 -
: 𝑓 ⋅ 𝑦"
𝑥𝑑 -

©2022 Su-In Lee 4


Review: backpropagation
§ Deep learning models are trained using
stochastic gradient descent (SGD)
§ Get a minibatch of examples
§ Calculate predictions and loss
§ Calculate gradients using “backprop” algorithm

©2022 Su-In Lee 5


Backpropagation

𝑥 ℎ! ℎ" 𝑦#

Input layer Output layer

Network weights

©2022 Su-In Lee 6


Backpropagation

𝑥 ℎ! ℎ" 𝑦#

Gradient descent

𝜃&'! = 𝜃& − 𝜂∇( ℒ(𝜃& )


Network parameters Forward pass

𝜃= 𝑤! , 𝑏! , 𝑤" , 𝑏" , 𝑤#$% , 𝑏#$% ℎ! = 𝜎 𝑤! 𝑥 + 𝑏!


ℎ" = 𝜎 𝑤" ℎ! + 𝑏"
©2022 Su-In Lee
𝑦# = 𝑤#$% ℎ" + 𝑏#$% 7
Backpropagation (cont.)
§ Model loss is mean prediction error:
5
1
ℒ 𝜃 = ) ℓ 𝑓 𝑥2; 𝜃 , 𝑦2
𝑛
234

§ Gradient calculation:
5
1
∇6 ℒ 𝜃 = ) ∇6 ℓ 𝑓 𝑥 2 ; 𝜃 , 𝑦 2
𝑛
234

©2022 Su-In Lee 8


Chain rule
§ Calculate gradients for all parameters 𝜃 using
chain rule
§ Get gradients for last hidden layer
§ Then for the previous layer
§ Then the layer before that…

§ Backpropagation = propagating gradients


backward through the network

Rumelhart et al., “Learning representations by back-propagating errors” (1986)

©2022 Su-In Lee 9


Chain rule (cont.)
𝑥 ℎ! ℎ" 𝑦#

𝜕ℒ 𝜕ℒ
ℒ 𝜃 = ( ℓ 𝑦,
* 𝑦
𝜕ℎ% 𝜕ℎ!

𝜕ℒ 𝜕ℒ 𝜕ℒ
𝜕𝑤% 𝜕𝑤! 𝜕𝑤"#$

𝜕ℒ 𝜕ℒ 𝜕ℒ
𝜕𝑏% 𝜕𝑏! 𝜕𝑏"#$
©2022 Su-In Lee 10
Propagation-based
explanations

§ Use backprop idea to quantify feature


importance

§ Rather than gradients w.r.t. parameters,


calculate gradients w.r.t. inputs

©2022 Su-In Lee 11


Input gradients
𝑥 ℎ! ℎ" 𝑦#

𝜕ℒ 𝜕ℒ
ℒ 𝜃 = ( ℓ 𝑦,
* 𝑦
𝜕ℎ% 𝜕ℎ!

𝜕ℒ 𝜕ℒ 𝜕ℒ 𝜕ℒ
𝜕𝑥 𝜕𝑤% 𝜕𝑤! 𝜕𝑤"#$

𝜕ℒ 𝜕ℒ 𝜕ℒ
𝜕𝑏% 𝜕𝑏! 𝜕𝑏"#$
©2022 Su-In Lee 12
Input gradients
𝑥 ℎ! ℎ" 𝑦#

Gradient of prediction
𝜕𝑓& 𝜕𝑓&
(instead of loss) 𝑓& 𝑥; 𝜃 = 𝑦*
𝜕ℎ% 𝜕ℎ!

𝜕𝑓& 𝜕𝑓& 𝜕𝑓& 𝜕𝑓&


𝜕𝑥 𝜕𝑤% 𝜕𝑤! 𝜕𝑤"#$

w.r.t. model input 𝜕𝑓& 𝜕𝑓& 𝜕𝑓&


𝜕𝑏% 𝜕𝑏! 𝜕𝑏"#$
©2022 Su-In Lee 13
Intuition
§ Partial derivatives represent sensitivity to small
perturbations

Delta from small change in 𝑖th direction


§ Mathematically:

𝜕𝑓7 𝑓7 𝑥 + 𝑒2 ⋅ 𝜖 − 𝑓7 𝑥
𝑥 = lim
𝜕𝑥2 8→: 𝜖

Limit as change becomes very small Measured relative to the size of change

©2022 Su-In Lee 14


Today
§ Section 1
§ Backprop review
§ Gradient-based explanations
§ Section 2
§ Modified backprop variants
§ Propagation vs. removal-based explanations

©2022 Su-In Lee 15


Application to XAI
§ Idea: find features that cause large output
changes when perturbed

§ Remark: quantifies feature sensitivity, but not


necessarily related to feature removal

©2022 Su-In Lee 16


Vanilla gradients
§ For an input 𝑥 and label 𝑦, calculate gradient of
the prediction 𝑓! 𝑥 :
𝜕𝑓7
𝑎2 = 𝑥
𝜕𝑥2

§ Can optionally use absolute value:


𝜕𝑓7
𝑎2 = 𝑥
𝜕𝑥2

Simonyan et al., “Deep inside convolutional networks: Visualizing image classification


models and saliency maps” (2014)

©2022 Su-In Lee 17


Vanilla gradients (cont.)

©2022 Su-In Lee 18


Variation 1: SmoothGrad
§ Average gradients across inputs near 𝑥
§ E.g., add Gaussian noise:

<=0
𝑎2 = 𝔼 𝑥+𝜖 where 𝜖 ∼ 𝒩 0, 𝜎 ?
<>1

§ In practice, use small number of 𝜖 samples (50)


§ Must tune 𝜎 ? to an appropriate level

Smilkov et al., “SmoothGrad: Removing noise by adding noise” (2017)

©2022 Su-In Lee 19


SmoothGrad (cont.)

Standard saliency maps Varying levels of input noise

©2022 Su-In Lee 20


Variation 2: Grad x Input
§ Multiply gradient by input values:
𝜕𝑓!
𝑎# = 𝑥# ⋅ 𝑥
𝜕𝑥#

Shrikumar et al., “Not just a black box: Learning important features through propagating
activation differences” (2016)

©2022 Su-In Lee 21


Grad x Input (cont.)
§ Interpretation: consider the model’s first-order
Taylor expansion around 𝑥$

@
𝜕𝑓7
𝑓7 𝑥 ≈ 𝑓7 𝑥: + 𝑥 − 𝑥: 𝑥:
𝜕𝑥

§ Gradient gives linearized version of model (like


replacing a function with its tangent line)
§ Grad x Input = approximates impact of setting
the input to zero
§ Similar to occlusion (see previous lecture)

©2022 Su-In Lee 22


Variation 3: Integrated
gradients
§ Gradients can become “saturated”
§ Model is sensitive to big input changes, but not small ones

In this region, 𝑓 𝑥 is
insensitive to small 𝑥
changes

Gradient ≈ 0

§ Saturation can yield small gradients, even for important inputs

Sundararajan et al., “Axiomatic attribution for deep networks” (2017)

©2022 Su-In Lee 23


IntGrad (cont.)
§ Idea: address saturation issue by calculating gradients
for rescaled images, 𝛼 ⋅ 𝑥
!"!
𝛼 ⋅ 𝑥 for 0 ≤ 𝛼 ≤ 1
!#"

§ Integrate (average) gradients across range of rescaled


images:
' 𝜕𝑓(
' 𝛼 ⋅ 𝑥 𝑑𝛼
$%& 𝜕𝑥)

§ Multiply by the input feature value:


' 𝜕𝑓(
𝑎) = 𝑥) ' 𝛼 ⋅ 𝑥 𝑑𝛼
$%& 𝜕𝑥)

©2022 Su-In Lee 24


IntGrad (cont.)
§ Implicitly relies on a zeros baseline
§ Can instead use a non-zero baseline 𝑥′
) 𝜕𝑓* %
𝑎$ = 𝑥$ − 𝑥$% ' 𝑥 + 𝛼 ⋅ 𝑥 − 𝑥% 𝑑𝛼
&'( 𝜕𝑥$

§ Related to a different idea from cooperative game


theory: the Aumann-Shapley value
§ Different from previous Shapley value
§ Has its own axiomatic derivation (see Sundararajan et al.)

©2022 Su-In Lee 25


IntGrad (cont.)
§ Problem: the integral is hard to calculate

§ Solution: use Riemann sum approximation for 𝑚


regularly spaced values 𝛼D ∈ 0,1 :

,
1 𝜕𝑓* %
𝑎$ ≈ 𝑥$ − 𝑥$% / 𝑥 + 𝛼+ ⋅ 𝑥 − 𝑥 %
𝑚 𝜕𝑥$
+')

©2022 Su-In Lee 26


IntGrad (cont.)

Gradient with image


rescaled by 𝛼

©2022 Su-In Lee 27


IntGrad (cont.)

©2022 Su-In Lee 28


GradCAM
§ In CNNs, hidden layers represent high-level
visual concepts
§ Hidden layers retain spatial information due to
convolutional structure

§ Idea: explain models via the last convolutional


layer instead of the input layer

Selvaraju et al., “Grad-CAM: Visual explanations from deep networks via gradient-based
localization” (2017)

©2022 Su-In Lee 29


CNN receptive fields

Receptive field grows at


each layer, but remains
localized

©2022 Su-In Lee 30


GradCAM procedure
§ Denote final layer’s hidden representation as 𝐴
§ Size is 𝐴 ∈ ℝ-×/×0
§ Width 𝑤, height ℎ, channels 𝑐
§ Each channel 𝑘 = 1, … , 𝑐 denoted as 𝐴1 ∈ ℝ-×/

§ The final prediction 𝑓! 𝑥 can be viewed as a


function of representation 𝐴
§ E.g., 𝐴 → global average pooling → MLP

©2022 Su-In Lee 31


GradCAM procedure (cont.)

𝑦-

𝑓( 𝑥

©2022 Su-In Lee 32


GradCAM procedure (cont.)
§ Calculate gradients w.r.t. 𝐴
23)
for all 𝑖, 𝑗, 𝑘
4,
*+

§ Average gradients within each channel:


( 1 𝜕𝑓(
𝛼* = 2 *
𝑤ℎ 𝐴)+
)+
§ Aggregate hidden representations across channel
*
dimension using 𝛼1
,
(
𝑎)+ = 2 𝛼* 𝐴*)+
*%'

©2022 Su-In Lee 33


GradCAM procedure (cont.)
§ Often use thresholding function (suppress negative
attributions):
,
(
𝑎)+ = ReLU 2 𝛼* 𝐴*)+
*%'

§ Can optionally upsample low-resolution scores 𝑎)+ to the


original input size (e.g., bilinear upsampling)

©2022 Su-In Lee 34


GradCAM interpretation
7
§ The values 𝛼F represent smoothed or averaged
gradient of class 𝑦 w.r.t. channel 𝑘

§ At each location, activations 𝐴F2D are multiplied by


averaged gradients and then aggregated

§ Similar to Grad x Input, but using a hidden layer


instead of input layer
§ Like a Taylor approximation of setting internal activations
to zero

©2022 Su-In Lee 35


GradCAM results

©2022 Su-In Lee 36


Other gradient-based
methods
§ Guided backprop: Springenberg et al., “Striving for simplicity: The all-
convolutional net” (2014)
§ VarGrad: Adebayo et al., “Local explanation methods for deep neural
networks lack sensitivity to parameter values” (2018)
§ Expected gradients: Erion et al., “Learning explainable models using
attribution priors” (2020)
§ BlurIG: Xu et al., “Attribution in scale and space” (2020)

©2022 Su-In Lee 37


Today
§ Section 1
§ Backprop review
§ Gradient-based explanations
§ 10 min break
§ Section 2
§ Modified backprop variants
§ Propagation vs. removal-based explanations

©2022 Su-In Lee 38


Propagation-based
explanations
(continued)
CSEP 590B: Explainable AI
Ian Covert & Su-In Lee
University of Washington

©2022 Su-In Lee 39


Today
§ Section 1
§ Backprop review
§ Gradient-based explanations
§ Section 2
§ Modified backprop variants
§ Propagation vs. removal-based explanations

©2022 Su-In Lee 40


Modified backpropagation
§ Previous approaches rely on gradient backprop
§ Others use heuristic backprop variants
§ Calculate “importance” of internal nodes, propagate
back to earlier ones
§ Requires justification for different backprop
heuristics

©2022 Su-In Lee 41


Layer-wise relevance
propagation (LRP)

§ Intuition: iteratively calculate relevance scores


for every layer of a model
§ Start with nodes in the last hidden layer
§ Move backwards through the model by splitting
scores in the previous layer

Bach et al., “On pixel-wise explanations for non-linear classifier decisions by layer-wise
relevance propagation” (2015)

©2022 Su-In Lee 42


LRP (cont.)
56)
§ Let 𝑅+ ∈ ℝ denote relevance for 𝑗th node in layer 𝑙 + 1
(8)
§ Initialize 𝑅) = 𝑓* 𝑥 for output node of interest
5,56)
§ Let 𝑅$←+ denote relevance message sent from 𝑗th node
in layer 𝑙 + 1 to 𝑖th node in layer 𝑙
§ Want messages to satisfy two conservation properties

56) 5,56)
𝑅+ = / 𝑅$←+ Summation of outgoing
$ importance

(5) 5,56)
Summation of incoming 𝑅$ = / 𝑅$←+
importance +

©2022 Su-In Lee 43


LRP (cont.)
§ The previous rules don’t define a unique procedure, so
the authors propose multiple options
§ For example, the “𝜖-rule”
5,56) 𝑧$+ 56)
𝑅$←+ = 𝑅+
𝑧+ + 𝜖 ⋅ sign(𝑧+ )

56) 5 56)
where 𝑧$+ = 𝑤$+ ℎ$ , 𝑧+ = ∑$ 𝑧$+ + 𝑏+ , and 𝜖 > 0

)
§ Finally, attributions are given by 𝑎$ = 𝑅$ for 𝑖 = 1, … , 𝑑

©2022 Su-In Lee 44


LRP (cont.)

©2022 Su-In Lee 45


LRP discussion
§ Arguably less intuitive than other methods,
requires some heuristic choices (which “rule” to
use?)
§ Can be difficult to adapt to different
architectures
§ E.g., does not automatically support residual
connections (ResNet architecture), requires extension
to transformers

©2022 Su-In Lee 46


Other modified
backpropagation methods
§ DeepLIFT: Shrikumar et al., “Not just a black box: Learning important
features through propagating activation differences” (2016)
§ PatternAttribution: Kindermans et al., “Learning how to explain neural
networks: PatternNet and PatternAttribution” (2017)
§ A unifying perspective: Ancona et al., “Towards better understanding of
gradient-based attribution methods” (2017)
§ Excitation backprop: Zhang et al., ” Top-down neural attention by
excitation backprop” (2018)
§ LRP variants: Montavon et al., “Layer-wise relevance propagation: an
overview” (2019)

©2022 Su-In Lee 47


Today
§ Section 1
§ Backprop review
§ Gradient-based explanations
§ Section 2
§ Modified backprop variants
§ Propagation vs. removal-based explanations

©2022 Su-In Lee 48


Many explanation methods
§ Removal-based explanations
§ SHAP, LIME, RISE, Occlusion, permutation tests

§ Propagation-based explanations
§ SmoothGrad, IntGrad, GradCAM

What should we use in practice?

©2022 Su-In Lee 49


Model flexibility

What kind of model are you explaining?

§ Removal-based explanations are model-agnostic


§ Can work with any model class (DNNs, trees, etc.)

§ Propagation-based explanations are mainly for neural


networks
§ Usually require differentiation
§ Some even have architecture constraints

©2022 Su-In Lee 50


Data flexibility
What kind of data do you have?

§ Removal-based explanations can handle discrete and


continuous features
§ E.g., replace inputs with alternative values from the dataset

§ Propagation-based methods only make sense for


continuous features
§ Derivative = sensitivity to small change
§ Small changes are meaningless for discrete features
§ E.g., 𝑥# ∈ {0, 1, 2}

©2022 Su-In Lee 51


Local or global
What kind of explanation do you need?

§ Both removal- and propagation-based methods can


produce local explanations

§ Removal-based methods are better suited for global


explanations
§ Can focus on a global model behavior (e.g., dataset loss)
§ To use a propagation-based method, we require an aggregation
scheme (e.g., mean of local explanations)

©2022 Su-In Lee 52


Speed
Is speed important?

§ Propagation-based methods are fast


§ Backward pass through DNN
§ Weak dependence on number of features

§ Removal-based methods can be slow


§ Often require making predictions with many feature subsets
§ Shapley values are particularly challenging

©2022 Su-In Lee 53


Quality
Which explanation is most informative or
correct?

§ Theory can serve as a guide


§ E.g., Shapley value axioms, IntGrad axioms
§ We can also take an empirical approach
§ Metrics for explanation quality (next lecture)

§ Perspective: no explanation is wrong, but some


procedures are misaligned with user questions

©2022 Su-In Lee 54


Popular methods
Which methods do most people use?

§ A small number of methods dominate


§ Depends on the data domain (tabular, image, NLP)

©2022 Su-In Lee 55


Tabular data
§ Permutation tests are widely used for global
feature importance

§ SHAP is ubiquitous for local explanations


§ TreeSHAP is built into XGBoost, LGBM
§ KernelSHAP used for other models

©2022 Su-In Lee 56


Computer vision
§ Gradient-based methods are currently most
popular: GradCAM, IntGrad
§ Removal-based methods are usually too slow
§ Some papers try to fix this, but not popular (yet?)
§ Masking model: Dabkowski & Gal, “Real time image saliency for black box
classifiers” (2017)
§ CXPlain: Schwab & Karlen, “CXPlain: Causal explanations for model
interpretation under uncertainty” (2019)
§ FastSHAP: Jethani et al., “FastSHAP: Real-time Shapley value estimation”
(2021)

©2022 Su-In Lee 57


NLP
§ NLP models (LSTMs, transformers) can use
most methods
§ Gradient-based methods are popular
§ Removal-based explanations are slower, but leave-
one-out (occlusion) is sometimes used

§ For transformers, some use attention as


explanation
§ Perhaps an interpretable architecture?
§ We’ll return to this in a later lecture

©2022 Su-In Lee 58


Popular packages

GitHub package Description Stars


SHAP variations (KernelSHAP,
slundberg/shap 15.8k
TreeSHAP, DeepSHAP, etc.)
marcotcr/lime LIME for images, tabular data 9.7k

utkuozbulak/pytorch-cnn-visualizations Various gradient-based methods 6.4k

jacobgil/pytorch-grad-cam GradCAM + GradCAM variations 4.5k


Various gradient-based methods
pytorch/captum 3.0k
+ SHAP
Various gradient-based methods
sicara/tf-explain 906
+ occlusion
kundajelab/deeplift DeepLIFT 616

ankurtaly/Integrated-Gradients IntGrad 453

©2022 Su-In Lee 59

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy