Some Stuff
Some Stuff
Some Stuff
Lord Apricot
School of the Dark Arts
Shadow Realm University
The Shadow Realm, TAS, AUSTRALIA, 666
lord.apricot@darkmagic.edu
A BSTRACT
Nothing here.
1 Introduction
Let’s just get to it.
We start with a parameterized distribution pθ (x) and a fixed distribution q(x). The cross-entropy between the two is:
Z
H(p, q) = − pθ (x) log q(x)dx = Ex∼p [log q(x)]
x
Our goal is to reduce H(p, q) by updating the parameters of pθ (x), which we can do by taking the gradient of H(p, q),
and stepping θ in the direction that minimizes it:
Z
∇θ H(p, q) = −∇θ pθ (x) log q(x)dx
x
T
" T
#
X X
π π
∇θ J(θ) = ∇θ H(pt , qt ) = −Es,a∼π ∇θ log πθ (at |st ) (Q (st , at ) − V (st ) − log Zt )
t t
This also implies a form of the policy gradient theorem that minimizes the KL-divergence instead. This can be recovered
using:
∇θ KL(p||q) = Ex∼p [∇θ log pθ (x)] − Ex∼p [∇θ log pθ (x) log q(x)]
Which reduces to:
T
" T
#
X X
π π
∇θ J(θ) = ∇θ KL(pt ||qt ) = −Es,a∼π ∇θ log πθ (at |st ) (1 − (Q (st , at ) − V (st ) − log Zt ))
t t
It is also straightforward to show that the pathwise derivative as used in DDPG, TD3, SVG(0), and SAC follows the
same pattern. Using a change of variables:
Z Z
∇θ H(p, q) = − pθ (x)∇θ log pθ log q(x)dx = − p()∇x log q(x(θ; ))∇θ x(θ; )d
x
We can recover the DDPG update by letting x = a such that a(θ; ) = f (θ) + , where epsilon is some form of noise
injection (Ornstein-Uhlenbeck in the case of DDPG, N (0, σI) in the case of TD3 and SVG(0)) and log q(x) = Qµ (s, a)
is the off-policy state-action value function:
Z
∇θ H(p, q) = − p(s)p()∇a Qµ (s, a(θ; ))∇θ a(θ; ) d ds ≈ −Es∼D,a∼π [∇a Qµ (s, a(θ; ))∇θ a(θ; )]
s,
Again, this implies a form of the loss function that minimizes the KL-divergence instead, which – using the same
identities as used previously – gives us:
As a side-effect of the above, we can choose other objective functions that minimize the same quantities, such as the
mean squared-error loss function. For example, A2C can be trained using the objective:
"T −1 #
X
π π 2
J(θ) = Eπ (log πθ (at |st ) + V (st ) − Q (st , at ))
t
Where the un-normalized value estimates are used (yes, this works). This not only reduces the cross-entropy between
the policy and the value function, but it also maximizes the return of the policy, since the expansion of the loss function
takes the form Kt − log πθ (at |st )(Qπ (st , at ) − V π (st )). This also provides a potential insight into what discrete-action
algorithms such as Q-learning are actually doing.
Q-learning uses the loss:
h i
µ 0 0
J(θ) = Es∼D,a∼ (Qπ (s, a) − (r + γ max
0
Q (s , a ))) 2
a
Since Qπ (s, a) is actually a function θ : S → Rd where d is the dimensionality of the action space, we can think of it
as being log pθ (a|s) for a categorical distribution. Rather than sampling from pθ (a|s) directly, we use heuristic noise
injection to select actions. Taking this view, we can re-frame q-learning as minimizing the loss:
2
A PREPRINT - M ARCH 12, 2020
Where we estimate V ∗ (s) ≈ maxa Qµ (s, a), and make use of the definition of the q-function. This can be expanded to
get K − log pθ (a|s)Q∗ (s, a), which – when we take the gradient with respect to θ – takes the form of a policy gradient
using the score function estimator.
As a side effect of minimizing the KL-loss function above, we also maximize the entropy (since we minimize the
negative entropy) in the process. This may have benefits for exploration at the cost of taking longer to learn (since we
would expect the agent to take worse actions a greater percentage of the time). This ties in with entropy-regularization
for exploration in RL algorithms – the entropy bonus is actually approximating a KL-divergence loss function.
References
[1] Wenjie Shi, Shiji Song, and Cheng Wu. Soft policy gradient method for maximum entropy deep reinforcement
learning. arXiv preprint arXiv:1909.03198, 2019.
[2] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy
deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.