0% found this document useful (0 votes)
22 views

1 Related Works

1. The document reviews several related works on using reinforcement learning to train recurrent neural networks. These include using RL to train Q-functions or advantage functions represented by LSTMs. 2. Many of the related works assume full observability of states or differentiability of models and rewards. Some train RNNs to rate the outputs of other RNNs or use evolutionary algorithms. 3. The document also reviews works incorporating recurrence into RL by including memory states in MDPs or feeding history to Q-functions. However, it notes these do not fully utilize backpropagation through time information.

Uploaded by

Han Solo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

1 Related Works

1. The document reviews several related works on using reinforcement learning to train recurrent neural networks. These include using RL to train Q-functions or advantage functions represented by LSTMs. 2. Many of the related works assume full observability of states or differentiability of models and rewards. Some train RNNs to rate the outputs of other RNNs or use evolutionary algorithms. 3. The document also reviews works incorporating recurrence into RL by including memory states in MDPs or feeding history to Q-functions. However, it notes these do not fully utilize backpropagation through time information.

Uploaded by

Han Solo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

1 Related Works

Balduzzi and Ghifary [3] learns a G-function, which predicts gradients of a Q


function, using a gradient perturbation trick. Fairbank et al. [4] trains a G func-
tion using a DQN-style training for G. However, to compute target gradients,
they assume that their model and reward function is differentiable with respect
to all inputs. Nguyen and Widrow [12] Jordan and Jacobs [10] learn a model
and differentiates through that. Prokhorov and Wunsch [14] gives an overview
of heuristic dynamic programming, dual heuristic programming, and globalized
gual heuristic programming.
Bahdanau et al. [1] use RL to train an Q-function, which is then used to train
a RNN. However, this assume that you always have access to the target label.
Jaques et al. [9] train a DQN to rate how good the output of an RNN is. Heess
et al. [8] trains a RNN using DDPG, but does not explore how to use these tasks
for supervised learning tasks. As such, their policy is super slow. Hausknecht
and Stone [7] train a LSTM Q-funtion in a DQN set up. However, they assume
that either (1) you always roll out from the beginning of an episode or that (2)
you begin a rollout from a random point and start with a zero hidden state.
Wierstra and Alexander [16] shows how to get the policy gradient for recurrent
policies. Bakker [2] uses an LSTM to represent an advantage function and use
eligibility traces.
Gomez and Schmidhuber [5] train RNNs with evolutionary algorithms. Schmid-
huber et al. [15] train RNNs by randomly guessing weights.
Hasinoff [6] provides a survey of old papers. Lin and Mitchell [11] explores
3 different ways to incorporate recurrence in RL: (1) feed history to Q function,
(2) recurrent Q function, and (3) recurrent model.
? ] augments the MDP to include memory states in the state, observation,
and actions. However, this doesnt use BPTT information. Peshkin et al. [13]
first introduces memory states

References
[1] Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan
Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. An Actor-
Critic Algorithm for Sequence Prediction. arXiv:1607.07086v1 [cs.LG],
2016. URL http://arxiv.org/abs/1607.07086.
[2] Bram Bakker. Reinforcement Learning with Long Short-Term Memory.
Advances in Neural Information Processing Systems 14, pages 14751482,
2002. ISSN 1049-5258.
[3] David Balduzzi and Muhammad Ghifary. Compatible Value Gradients for
Reinforcement Learning of Continuous Deep Policies. pages 127, 2015.
URL http://arxiv.org/abs/1509.03005.
[4] Michael Fairbank, Eduardo Alonso, and Danil Prokhorov. An equiva-
lence between adaptive dynamic programming with a critic and back-

1
propagation through time. IEEE Transactions on Neural Networks
and Learning Systems, 24(12):20882100, 2013. ISSN 2162237X. doi:
10.1109/TNNLS.2013.2271778.
[5] F. J. Gomez and J. Schmidhuber. Co-Evolving Recurrent Neurons Learn
Deep Memory POMDPs. Galleria Rassegna Bimestrale Di Cultura, pages
114, 2004. doi: 10.1145/1068009.1068092.
[6] Samuel W Hasinoff. Reinforcement Learning for Problems with Hidden
State. University of Toronto, Technical Report, pages 118, 2003.
[7] Matthew Hausknecht and Peter Stone. Deep Recurrent Q-Learning for
Partially Observable MDPs. 2015.
[8] Nicolas Heess, Jonathan J Hunt, Timothy P Lillicrap, and David Silver.
Memory-based control with recurrent neural networks. arXiv, pages 111,
2015. URL http://arxiv.org/abs/1512.04455.
[9] Natasha Jaques, Shixiang Gu, Richard E Turner, and Douglas Eck. Gen-
erating Music by Fine-Tuning Recurrent Neural Networks with Reinforce-
ment Learning. pages 111, 2016.
[10] Michael I Jordan and Robert A Jacobs. Learning to control an unstable
system with forward modeling. Advances in Neural Information Processing
Systems, 2:324331, 1990.

[11] L.J. Lin and T.M. Mitchell. Memory approaches to reinforcement learning
in non-Markovian domains. Artificial Intelligence, 8(7597):28, 1992. URL
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.52.319.
[12] Derrick H. Nguyen and Bernard Widrow. Neural Networks for Self-Learning
Control Systems, 1990. ISSN 02721708.

[13] Leonid Peshkin, Nicolas Meuleau, and Leslie Kaelbling. Learn-


ing Policies with External Memory. Sixteenth International
Conference on Machine Learning, (March):8, 2001. ISSN
1098-6596. doi: 10.1017/CBO9781107415324.004. URL
http://arxiv.org/abs/cs/0103003.

[14] Danil V. Prokhorov and Donald C. Wunsch. Adaptive critic designs. IEEE
Transactions on Neural Networks, 8(5):9971007, 1997. ISSN 10459227.
doi: 10.1109/72.623201.
[15] Jurgen Schmidhuber, Sepp Hochrieter, and Yoshua Bengio. Evaluating
Long-term Dependendency Denchmark Problems by Random Guessing.
1997.
[16] Daan Wierstra and F Alexander. Recurrent Policy Gradients. (May 2009).

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy