0% found this document useful (0 votes)
6 views

Chapter_1_Introduction_RL_Report_Kiran

Reinforcement Learning (RL) trains agents to make decisions through interaction with environments, focusing on maximizing cumulative rewards. The document discusses the Deep Q-Network (DQN) algorithm, an advanced method in RL that enhances classical Q-learning using deep neural networks, and outlines its key innovations like Experience Replay and Target Network. The implementation targets the CartPole balancing problem, utilizing DQN's capabilities to effectively manage the exploration-exploitation trade-off in a continuous state space.

Uploaded by

tkirangowda15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Chapter_1_Introduction_RL_Report_Kiran

Reinforcement Learning (RL) trains agents to make decisions through interaction with environments, focusing on maximizing cumulative rewards. The document discusses the Deep Q-Network (DQN) algorithm, an advanced method in RL that enhances classical Q-learning using deep neural networks, and outlines its key innovations like Experience Replay and Target Network. The implementation targets the CartPole balancing problem, utilizing DQN's capabilities to effectively manage the exploration-exploitation trade-off in a continuous state space.

Uploaded by

tkirangowda15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

1.

Introduction

1.1 Overview of Reinforcement Learning


Reinforcement Learning (RL) is a paradigm within machine learning that focuses on
training an agent to make sequential decisions by interacting with an environment. Unlike
supervised learning, where ground-truth labels guide the learning process, RL agents learn
from the consequences of their actions through rewards or penalties. The learning goal is to
maximize the expected cumulative reward over time.

Formally, an RL setting is often modeled as a Markov Decision Process (MDP), which


comprises:
- A set of states S
- A set of actions A
- A reward function R(s, a)
- A state transition function P(s'|s, a)
- A discount factor γ ∈ [0,1]

At each discrete time step t, the agent observes a state s_t, selects an action a_t, receives a
reward r_t, and transitions to a new state s_{t+1}. This cycle continues until a terminal state
is reached. The agent's behavior is governed by a policy π(a|s), which maps states to
actions.

Reinforcement Learning approaches can be broadly categorized as:


- Value-based methods (e.g., Q-learning)
- Policy-based methods (e.g., REINFORCE)
- Actor-Critic methods

Over time, classical algorithms like Q-learning have evolved into deep reinforcement
learning methods such as Deep Q-Networks (DQN), which use neural networks as function
approximators to handle high-dimensional input spaces. These advances have enabled RL
to operate in complex environments that were previously infeasible.

1.2 Key Algorithms in RL Used in This Implementation


The selected problem—CartPole balancing—is addressed using the Deep Q-Network (DQN)
algorithm, a foundational method in deep reinforcement learning. DQN is an extension of
the classical Q-learning algorithm, enhanced with deep neural networks to handle
continuous and high-dimensional state spaces.

Q-learning is a model-free, value-based RL algorithm. It seeks to learn the optimal action-


value function Q*(s,a), which gives the maximum expected future reward achievable from a
given state-action pair under the optimal policy. The Q-function is updated iteratively using
the Bellman equation:
Q(s_t, a_t) ← Q(s_t, a_t) + α [ r_t + γ max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t) ]

where:
- α is the learning rate
- γ is the discount factor
- r_t is the reward at time t

DQN enhances Q-learning by approximating the Q-function using a deep neural network
Q(s, a; θ), where θ are the trainable weights. Key innovations that stabilize training in DQN
include:

- Experience Replay: A buffer that stores previous transitions (s, a, r, s'), enabling random
sampling and breaking correlation between sequential data points.
- Target Network: A separate network Q' with fixed weights θ' used to compute the target
Q-value. It is periodically updated to match the main network.
- Epsilon-Greedy Policy: Introduces exploration by selecting a random action with
probability ε and the best-known action with probability 1 - ε.

The loss function minimized in DQN is:


L(θ) = E_{(s,a,r,s')} [(r + γ max_{a'} Q(s', a'; θ⁻) - Q(s, a; θ))²]

This formulation ensures that the network learns to approximate the optimal action-value
function over time.

The CartPole environment from OpenAI Gym features:


- A continuous 4-dimensional state space
- A discrete 2-action space (left or right force)
- A reward signal of +1 for every time step the pole remains upright

Given the continuous nature of the state representation and the simplicity of the action
space, DQN is highly effective. It provides a clear and interpretable case study in balancing
the exploration-exploitation trade-off and understanding value approximation using neural
networks. Additionally, DQN allows for easy integration with the Stable Baselines3 library,
enabling fast and reproducible implementation with visual monitoring tools such as
TensorBoard.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy