Chapter_1_Introduction_RL_Report_Kiran
Chapter_1_Introduction_RL_Report_Kiran
Introduction
At each discrete time step t, the agent observes a state s_t, selects an action a_t, receives a
reward r_t, and transitions to a new state s_{t+1}. This cycle continues until a terminal state
is reached. The agent's behavior is governed by a policy π(a|s), which maps states to
actions.
Over time, classical algorithms like Q-learning have evolved into deep reinforcement
learning methods such as Deep Q-Networks (DQN), which use neural networks as function
approximators to handle high-dimensional input spaces. These advances have enabled RL
to operate in complex environments that were previously infeasible.
where:
- α is the learning rate
- γ is the discount factor
- r_t is the reward at time t
DQN enhances Q-learning by approximating the Q-function using a deep neural network
Q(s, a; θ), where θ are the trainable weights. Key innovations that stabilize training in DQN
include:
- Experience Replay: A buffer that stores previous transitions (s, a, r, s'), enabling random
sampling and breaking correlation between sequential data points.
- Target Network: A separate network Q' with fixed weights θ' used to compute the target
Q-value. It is periodically updated to match the main network.
- Epsilon-Greedy Policy: Introduces exploration by selecting a random action with
probability ε and the best-known action with probability 1 - ε.
This formulation ensures that the network learns to approximate the optimal action-value
function over time.
Given the continuous nature of the state representation and the simplicity of the action
space, DQN is highly effective. It provides a clear and interpretable case study in balancing
the exploration-exploitation trade-off and understanding value approximation using neural
networks. Additionally, DQN allows for easy integration with the Stable Baselines3 library,
enabling fast and reproducible implementation with visual monitoring tools such as
TensorBoard.